Pencil-based algorithms for tensor rank decomposition are not stable

We prove the existence of an open set of $n_1\times n_2 \times n_3$ tensors of rank $r$ on which a popular and efficient class of algorithms for computing tensor rank decompositions based on a reduction to a linear matrix pencil, typically followed by a generalized eigendecomposition, is arbitrarily numerically forward unstable. Our analysis shows that this problem is caused by the fact that the condition number of the tensor rank decomposition can be much larger for $n_1 \times n_2 \times 2$ tensors than for the $n_1\times n_2 \times n_3$ input tensor. Moreover, we present a lower bound for the limiting distribution of the condition number of random tensor rank decompositions of third-order tensors. The numerical experiments illustrate that for random tensor rank decompositions one should anticipate a loss of precision of a few digits.


Introduction
We study the numerical stability of one of the most popular and effective class of algorithms for computing the tensor rank decomposition, or canonical polyadic decomposition (CPD), of a tensor. Recall that a rank-1 tensor is represented by a multidimensional n 1 × n 2 × · · · × n d array B = (b i1,i2,...,i d ) 1≤i1≤n1,...,1≤i d ≤n d whose elements satisfy the following property: For brevity, one writes B = b 1 ⊗ b 2 ⊗ · · · ⊗ b d . The CPD of A ∈ R n1×···×n d was proposed by Hitchcock [26]. It expresses A as a minimum-length linear combination of rank-1 tensors: (1.1) A = A 1 + A 2 + · · · + A r , where A i = a 1 i ⊗ a 2 i ⊗ · · · ⊗ a d i and a k i ∈ R n k for all i = 1, . . . , r and k = 1, . . . , d. The number r in (1.1) is called the rank and d is the order of A. It is often convenient to consider the factor matrices A 1 , . . . , A d , where A k := [a k i ] r i=1 . Mainly due to its simplicity and uniqueness properties [12,30], the CPD has found application in a diverse set of scientific fields; see [8,14,15,28,29,39,40]. A rank-r tensor A is called ridentifiable if the set of rank-1 tensors {A 1 , A 2 , . . . , A r } whose sum is A, as in (1.1), is uniquely determined given A. A classic result result on r-identifiability is Kruskal's criterion [30]. It is formulated in terms of the Kruskal rank k M of a matrix M : k M is the largest integer k such that every subset of k columns of M has rank equal to k.
Most low-rank tensors satisfy Kruskal's criterion; more precisely, there is an open dense subset of the set of rank-r tensors in R n1×n2×n3 , n 1 ≥ n 2 ≥ n 3 ≥ 2, where r-identifiability holds, provided that r ≤ n 1 + min{ 1 2 δ, δ} with δ := n 2 + n 3 − n 1 − 2. The computational problem of recovering the set of rank-1 tensors {A 1 , . . . , A r } whose sum is A is called the tensor rank decomposition problem (TDP). When the rank of a third-order tensor is sufficiently small, there are efficient, numerical, direct algorithms for solving the TDP, such as those in [18-20, 33, 34, 37, 38]. All of these algorithms involve the computation of a generalized eigendecomposition (GEVD) of a linear matrix pencil constructed from the low-rank input tensor. An algorithm for solving TDPs that involves such a reduction to a matrix pencil will subsequently be called a pencil-based algorithm (PBA). This will be given a precise meaning in Definition 5.1, where we rigorously define the class of PBAs.
A prototypical example of a PBA is presented next. The essential idea is to project a given tensor A ∈ R n1×n2×n3 , n 1 ≥ n 2 ≥ r, to a tensor of format n 1 × n 2 × 2 and recover the first factor matrix from the latter. The input A = r i=1 a i ⊗ b i ⊗ c i is assumed to admit a unique rank-r CPD with a i = 1 for all i = 1, . . . , r. Let Q ∈ R n3×2 be a matrix with orthonormal columns. Then, contracting A along the third mode by Q T , which is a special type of multilinear multiplication [17,28], yields the tensor and I m denotes the m × m identity matrix. Let Q 1 ∈ R n1×r , respectively Q 2 ∈ R n2×r , be a matrix with orthonormal columns that form a basis for {a i }, respectively {b i }. The following is then a specific orthogonal Tucker decomposition [42] of B: Then it follows from the properties of multilinear multiplication that the core tensor S = (Q T 1 , Q T 2 , I 2 ) · B ∈ R r×r×2 has the following two 3-slices: . Whenever S 1 and S 2 are nonsingular, we have S 1 S −1 2 = X diag(λ 1 ) diag(λ 2 ) −1 X −1 ; thus X is the matrix of eigenvectors of the GEVD of the nonsingular matrix pencil (S 1 , S 2 ). As long as the eigenvalues are distinct, the matrix X is uniquely determined and it follows that A = Q 1 X. Finally, the rank-1 tensors A i = a i ⊗ b i ⊗ c i are recovered by the following well-known property [28,39] of the 1-flattening: A (1) = A(B ⊙ C) T , where M ⊙ N := [m i ⊗ n i ] r i=1 ∈ R mn×r is the Khatri-Rao product of M ∈ R m×r and N ∈ R n×r . Then, we see that where X † is the Moore-Penrose pseudoinverse of X. This procedure thus solves the TDP.
The above algorithm and those in [18-20, 33, 34, 37, 38] have the major advantage that the CPD can be computed via a sequence of numerically stable and efficient linear algebra algorithms for solving classic problems such as linear system solving, linear least-squares and generalized eigendecomposition problems. In light of the plentiful indications that computing a CPD is a difficult problem-the NP-completeness of tensor rank [27], the ill-posedness of the corresponding approximation problem [17], and the potential (average) ill-conditioning of the TDP [4,5]-the existence of aforementioned algorithms is almost too good to be true. We show that there is a price to be paid in the currency of the achievable precision by establishing the following result. Theorem 1.2. Let n 1 ≥ n 2 ≥ n 3 > r + 1 ≥ 2. For every pencil-based algorithm, there exists an open set of the rank-r tensors in R n1×n2×n3 for which it is unstable.
In practice, Theorem 1.2 covers the algorithms from [20,33,34,37,38], cpd gevd from Tensorlab v3.0 [45], [18,Algorithm 2], and the foregoing prototypical PBA. Algorithm 1 of [18], as well as both algorithms in [19], are likely also unstable because they use an unstable algorithm in intermediate steps; a more thorough analysis would be required to show this rigorously. Remark 1.3. For higher-order tensors A ∈ R n1×···×n d with d ≥ 4 it is a common practice to reshape them into a third-order tensor A (j,k,l) ∈ R m1×m2×m3 by choosing a partition of the indices {1, . . . , d} = {j 1 , . . . , j s } ⊔ {k 1 , . . . , k t } ⊔ {l 1 , . . . , l u } with m 1 = j 1 · · · j s , m 2 = k 1 · · · k t , and m 3 = l 1 · · · l u . Under the conditions of section 7 of [13], the CPD of A (j,k,l) , i.e., the set of rank-1 tensors, can be reshaped back into a set of order-d tensors in R n1×···×n d yielding the CPD of A. According to Theorem 1.2 this strategy employs an unstable algorithm as intermediate step, so we should a priori expect that the resulting algorithm is also unstable. This can be proved rigorously for u = |l| = 1 by a slight generalization of the argument in Section 6. We leave a general proof as an open question.
It is important to mention that the stabilities of algorithms employed in the intermediate steps of a PBA are not the reason why PBAs are unstable. In the above prototypical PBA, all individual steps can be implemented using numerically stable algorithms, but the resulting algorithm is nevertheless unstable. The instability in Theorem 1.2 is caused by a large difference between the condition numbers of the TDPs in R n1×n2×n3 and R n1×n2×2 .
The condition number of the TDP was studied in [4]. 1 Let us denote the set of n 1 × · · · × n d tensors of rank 1 by S. This set is actually a smooth manifold, called the Segre manifold ; see Section 4.1. Tensors of rank at most r are obtained as the image of the addition map Φ r : S ×r → R n1×···×n d , (A 1 , . . . , A r ) → A 1 + · · · + A r . The condition number of the TDP at a rank-r tensor A with ordered CPD a = (A 1 , . . . , A r ) is . . , A r ); see [4]. The norms are the Euclidean norms on the ambient spaces of domain and image of Φ r , which is naturally identified with the Frobenius norms of tensors, i.e., the square root of the sum of squares of the elements. It follows from the spectral characterization in [4, Theorem 1.1] that A depends uniquely on the (unordered) CPD {A 1 , . . . , A r }; therefore we often write κ(A 1 , . . . , A r ) for the condition number. If such a local inverse does not exist, we have κ(A 1 , . . . , A r ) := ∞. In Section 4.1 we discuss in more detail the existence of this local inverse function; it will be shown in Proposition 4.7 that "most tensors have a finite condition number." While the proof of Theorem 1.2 is not straightforward, the main intuition that led us to its conception is the observation that there appears to be a gap in the expected value of the condition number of TDPs in R m1×m2×2 and other spaces R m1×m2×m3 , m 1 ≥ m 2 ≥ m 3 ≥ 3, as we observed in [5]. Here, we derived a further characterization of the distribution of the condition number of random CPDs, based on a result of Cai, Fan, and Jiang [10] about the distribution of the minimum distance between random points on spheres. Theorem 1.4. Let a 1 , . . . , a r ∈ R m1 , b 1 , . . . , b r ∈ R m2 be arbitrary and fixed, while we assume that c 1 , . . . , c r ∈ R m3 are independent random vectors with standard normal entries. Consider the random rank-1 tensors A i = a i ⊗ b i ⊗ c i ∈ R m1×m2×m3 . Then, for any α > 0 we have , where Γ is the gamma function. In particular, if m 3 = 2 we have This theorem suggests that as m 3 increases, very large condition numbers become increasingly unlikely. The worst case thus seems to occur for m 3 = 2, which is exactly the space from which PBAs try to recover the CPD. For example, if m 3 = 2 and r is large we can expect that the condition number is greater than 4r 2 with probability at least (around) 5%.
Outline. The next section recalls some preliminary material. As Theorem 1.4 provides the main intuition for the main result, we will treat it first in Section 3. Before proving Theorem 1.2, we need a precise definition of a PBA. This definition relies on the notion of r-nice tensors that we study in Section 4; these rank-r tensors have convenient differential-geometric properties. Then, in Section 5 we define the class of PBAs. Section 6 is dedicated to the proof of Theorem 1.2. Numerical experiments validating the theory and illustrating typical behavior for random CPDs are presented in Section 7. Finally, Section 8 presents our main conclusions.
Notation. The following notational conventions are observed throughout this paper: scalars are typeset in lower-case letters (a), vectors in bold-face lower-case letters (a), matrices in uppercase letters (A), tensors in a calligraphic font (A), and varieties and manifolds in an alternative calligraphic font (A). The unit sphere over The symmetric group of permutations on r elements is denoted by S r . P π denotes the r × r permutation matrix representing the permutation π ∈ S r . The standard Euclidean inner product on R m is x, y := x T y for x, y ∈ R m .
Acknowledgements. We thank Vanni Noferini and Leonardo Robol for interesting discussions on the definition of numerical instability.

Preliminaries
Some elementary definitions from multilinear algebra and differential geometry are recalled.
2.1. Multilinear algebra. The tensor product ⊗ of vector spaces V 1 , . . . , V d is denoted by ⊗; see [21,Chapter 1]. As the tensor product is unique up to isomorphisms of the vector spaces V 1 × · · · × V d and V 1 ⊗ · · · ⊗ V d , we will be particularly liberal between the interpretations R n1 ⊗ · · · ⊗ R n d ≃ R n1×···×n d ≃ R n1···n d . Elements in the first space are abstract order-d tensors, in the second space they are d-arrays, while in the last space they are long vectors. We do not use a "vectorization" operator to indicate the natural bijection between the last two spaces.
The tensor product of linear maps is also well defined [21,Chapter 1]. We use this definition in expressions M 1 ⊗ · · · ⊗ M d , where M k = [m k i ] i ∈ R m k ×n k , whose columns are m 1 i1 ⊗ · · · ⊗ m d i d ; the order will not be relevant wherever it is used. The multilinear multiplication of a tensor A = i1,...,i d a i1,...,i d e 1 i1 ⊗ · · · ⊗ e d i d ∈ R n1×···×n d with the above matrices M k is This also entails the following well-known formula for the inner product between rank-1 tensors: see, e.g., [22,Section 4.5]. The Khatri-Rao product of the matrices Note that it is a subset of columns from the tensor product M 1 ⊗ · · · ⊗ M d .
2.2. Differential geometry. The following elementary definitions are presented here only for submanifolds of Euclidean spaces; see, e.g., [32] for the general definitions. By a smooth (C ∞ ) manifold we mean a topological manifold with a smooth structure, in the sense of [32]. The tangent space at x to an n-dimensional smooth submanifold M ⊂ R N can be defined as It is a vector subspace whose dimension coincides with the dimension of M. Moreover, at every point x ∈ M, there exist open neighborhoods V ⊂ M and U ⊂ T x M of x, and a bijective smooth map φ : V → U with smooth inverse. The tuple (V, φ) is a coordinate chart of M. A smooth map between manifolds F : M → N is a map such that for every x ∈ M and coordinate chart (V, φ) containing x, and every coordinate chart (W, ψ) containing F (x), we have that is a smooth map. The derivative of F can be defined as the linear A Riemannian manifold (M, g) is a smooth manifold M equipped with a Riemannian metric g, which is an inner product g x (·, ·) on the tangent space T x M that varies smoothly with x ∈ M. If M ⊂ R m , then the inherited Riemannian metric from R m is g x (x, y) = x, y for every x ∈ M. The length of a smooth curve γ : [0, 1] → M is defined by and the distance dist M (x, y) between two points x, y ∈ M is the length of the minimal curve with extremes x and y.
In Section 1, we denoted the Segre manifold of rank-1 tensors in R n1×···×n d by S. To emphasize the format, we sometimes write S n1,...,n d instead. Section 1 also defined the addition map Tensors of rank (at most) r are denoted by It is a semi-algebraic set by the Tarski-Seidenberg principle [3], because it is the projection of an algebraic variety, namely the graph of Φ r [32]. Recall that this means that σ r can be described as the locus of points that satisfy a system of polynomial equations and inequalities; see [3]. The dimension of σ r equals the dimension of the smallest R-variety σ r containing it [3, Chapter 2].

Numerical analysis.
For a smooth map f : M → N between Riemannian manifolds (M, g) and (N , h) there is a standard definition of the condition number [2,9,36], which generalizes the classic case of smooth maps between Euclidean spaces, namely where d x f : T x M → T f (x) N is the derivative of f , and t x M,x := g x (t x , t x ) for t x ∈ T x M (resp. t y N ,y := h y (t y , t y ) for t y ∈ T y N ) is the norm on the tangent space T x M (resp. T y N ) induced by the Riemannian metric g (resp. h).

Estimating the distribution of the condition number
We start by proving the second main result, Theorem 1.4, because little technical machinery is required. In the proof, we use the following identification of the condition number with the inverse of the smallest singular value of an auxiliary matrix: for 1 ≤ i ≤ r let U i be a matrix whose columns form an orthonormal basis of T Ai S. Then, by [4, Theorem 1.1], where ς min denotes the smallest singular value. The smallest singular value ς min (d (A1,...,Ar) Φ r ) is actually equal to the r(n 1 + · · · + n d − d + 1)th singular value of the Jacobian matrix of Φ r seen as a C ∞ map from R rn1···n d to R n1···n d . Moreover, from (3.1) it follows that the condition number is scale invariant : for all t 1 , . . . , t r ∈ R\{0} we have κ(t 1 A 1 , . . . , t r A) = κ(A 1 , . . . , A). Cai, Fan, and Jiang [10] proved tail probabilities for the maximal pairwise angle of an independent sample of uniformly distributed points on the sphere. The idea for using their results in the proof of Theorem 1.4 is to lower bound the condition number by such a maximal angle. This we do next.
Proof. Without restriction we can assume that the maximum is attained for i = 1 and j = 2. By (3.1), the condition number is the inverse of the least singular value of the matrix where U i is any orthonormal basis for T Ai S. In particular, the following orthonormal bases can be chosen for T A1 S and T A2 S (see, e.g., [4, Section 5.1]): Observe that U 1 contains the tangent vector a 2 ⊗ b 1 ⊗ c 1 and U 2 contains the tangent vector a 2 ⊗ b 1 ⊗ c 2 as columns. Then, using the computation rules for inner products from (2.1), we find that the least singular value of T is smaller than Repeating the argument for the tangent vector −a 2 ⊗ b 1 ⊗ c 2 in U 2 we get concluding the proof.  In the first set of experiments, visualized in Figure 3.1(A), we generated 10 5 random rank- It is observed that the limiting distribution of Theorem 1.4 seems to approximate the shape of the distribution of the condition numbers reasonably well. However, the lower bound seems rather weak for n = 2. One of the main observations, which is also evident from the formula of the limiting distribution, is that as n increases the probability of sampling tensors with a high condition number decreases. As is evident from the empirical ccdf in Figure 3.1(A), n = 2 admits the worst distribution by far: there is a 10% probability of sampling a condition number greater than 10 5 , and still a 0.1% chance to encounter a condition number greater than 10 8 . On the other hand, for n = 15, all sampled tensors had a condition number less than 10.
In the second set of experiments, shown in Figure 3.1(B), we generated 10 5 random rank-15 tensors of size 15 × 15 × n in a different way in order to illustrate the quality of the lower bound in Theorem 1.4. This time, after sampling the factor matrices (A, B, C) as above, we perform Gram-Schmidt orthogonalization of A and B. As can be seen in    The tensors A = 15 i=1 a i ⊗ b i ⊗ c i were generated by randomly sampling factor matrices A ∈ R 15×15 , B ∈ R 15×15 and C ∈ R n×15 , as indicated.
We had one additional reason to treat Theorem 1.4 first: on a fundamental level, a PBA solves the TDP for n 1 × n 2 × n 3 tensors by transforming it into a TDP for n 1 × n 2 × 2 tensors. The above experiments clearly show that the latter problem has a much worse distribution of condition numbers than the original problem. In other words, from the viewpoint of sensitivity, PBAs try to solve an easy problem via the solution of a significantly more difficult problem. This approach is nearly guaranteed to end in instability.

The manifold of r-nice tensors
While the instability of PBAs is already plausible from Figure 3.1, proving Theorem 1.2 is substantially more complicated. In order to prove it, we should first formalize what we mean by "solving a TDP." This problem is rife with subtleties.
For example, what should the solution of a TDP be if the input tensor A is the generic rank-11 tensor in C 11×6×3 ? This tensor has 352, 716 isolated CPDs [24]. Computing all of them seems computationally infeasible. Nevertheless, all of them are well-behaved because each one of these will vary smoothly in a small open neighborhood of A in C 11×6×3 . On the other hand, the generic rank-6 tensor of multilinear rank (4, 4, 4) B in C 6×6×6 behaves erratically. It has 2 isolated decompositions [11, Theorem 1.3], but a generic rank-6 tensor close to B has only one decomposition that can be moved around continuously such that its limit is a decomposition of B. This process works for both of B's decompositions, because the rank-6 tensors have two smooth folds meeting in B [12,Example 4.2]. What should an algorithm compute in this case?
For an r-identifiable tensor A there is an unambiguous answer to the above question. Namely, the solution is the unique set of rank-1 tensors {A 1 , . . . , A r } whose sum is A. The goal of this section is to carefully define a tensor decomposition map τ r;n1,...,n d in Definition 4.8 whose computation solves the TDP for a subset of rank-r tensors. The domain where the smooth function τ r;n1,...,n d is well defined deserves its own definition, Definition 4.1 below; we call it the manifold of r-nice tensors N ⊂ σ r . In Proposition 4.7 we prove that N is a Zariski open dense subset of the set of rank-r tensors, so that "almost all tensors are r-nice." Before defining N , we first need the following two standard definitions. If for a collection of r vectors p 1 , . . . , p r ∈ R n every subset of min{r, n} many vectors is linearly independent, then the vectors are said to lie in general linear position (GLP). We say that a collection of r rank-1 tensors 3) the definition of rank-r tensors σ r and its closure σ r . Then, M r;n1,...,n d ⊂ S ×r n1,...,n d is defined to be the set containing all the rank-1 tuples a = (A 1 , . . . , A r ) satisfying the following properties: is r-identifiable, and, thus, has rank equal to r, (iii) a has finite condition number, (iv) a is in SGLP, and (v) for all i the (1, 1, . . . , 1)-entry of A i is not equal to zero. The r-nice tensors N r;n1,...,n d are defined to be the image of M r;n1,...,n d under the addition map Φ r from (2.2): N r;n1,...,n d := Φ r (M r;n1,...,n d ). If it is clear from the context we drop the subscript from both M r;n1,...,n d and N r;n1,...,n d and simply write M and N .
Remark 4.2. The reason for the last requirement, (v), is that under this restriction we can define a parametrization of rank-1 tensors that is a diffeomorphism; see the next subsection for details.

4.1.
Elementary results. Before proceeding, we need a few elementary results related to the differential geometry of CPDs, which we did not find in the literature. The , form the affine cone over a smooth projective variety (see, e.g., [31]) and, hence, S is an analytic submanifold of R n1×···×n d . Its dimension is 1 + d k=1 (n k − 1) [31]. The map 2 Ψ n1,...,n d : 1}}, which has 2 d elements. Moreover, Ψ n1,...,n d is a proper map so that it is a 2 d -sheeted smooth covering map [32, p. 91-95].
Let S + (R n ) = {u ∈ S(R n ) | u 1 > 0} be the "upper" half of the unit sphere; it is a submanifold in the subspace topology on R n . Let us define the following restriction of Ψ: It follows from the foregoing that Ψ * n1,...,nr is a bijective local diffeomorphism onto its image, so it is a (global) diffeomorphism onto its image. Let S * n1,...,nr be the image of Ψ * n1,...,nr : . When it is clear from the context we drop the subscripts from Ψ n1,...,n d , Ψ * n1,...,n d and S * n1,...,n d . The foregoing explains part (v) in Definition 4.1: we wish to work with a parametrization of S that is a diffeomorphism, so we restrict ourselves to S * and use Ψ * . We will show in the proof of Proposition 4.5 that S * is open in the Zariski topology and, hence, open and dense in the Euclidean topology.
Finally, we consider the subset S r;n ⊂ (S + (R n )) ×r defined as . . , s r ] ∈ R n×r has full rank}.
Note that S r;n is an open submanifold, because the locus of points not satisfying the rank condition is closed in the Zariski topology. We also have the following result. Lemma 4.3. Let S r be the symmetric group on r elements. Then, S r;n := S r;n /S r is a manifold. Moreover, the projection π : S r;n → S r;n , (x 1 , . . . , x r ) → {x 1 , . . . , x r } is a local diffeomorphism.
Proof. S r is a discrete Lie group acting smoothly [ [12,13] for the state of the art. In the context of PBAs, the following standard result suffices.
Proof. This is follows, for example, from the effectiveness of Kruskal's criterion; see [13].
Next, we prove an important property of the set M r;n1,...,n d from Definition 4.1.
would then prove the assertion.
Recall that generic r-identifiability implies nondefectivity of σ r ; see [31, Chapter 5, specifically Corollary 5.3.1.3]. Hence, dim σ r = dim σ r = dim S ×r = r dim S. The subvariety Σ ⊂ σ r of singular points is proper and closed in the Zariski topology by definition [23]. This means that in addition to the polynomials that vanish on the R-variety σ r , there are k ≥ 1 additional nontrivial polynomial equations with coefficients over R such that f 1 (y) = · · · = f k (y) = 0 for all y ∈ Σ. If y has a preimage x ∈ S ×r under Φ r , then f 1 (Φ r (x)) = · · · = f k (Φ r (x)) = 0. Hence, the locus B (i) of decompositions not satisfying condition (i) in Definition 4.1, which map into the singular locus Σ under Φ r is a Zariski-closed set. It is also a proper subset, because otherwise Φ r (S ×r ) = σ r ⊂ Σ, which is a contradiction as dim Σ < dim σ r = dim σ r .
The set of tensors in σ r with several decompositions is closed in the Zariski topology by assumption. We can apply the same argument as in the previous paragraph to conclude that the variety of decompositions B (ii) ⊂ S ×r that map to points of σ r that are not r-identifiable is a proper Zariski closed subset in S ×r .
The subset B (iii) ⊂ S ×r of decompositions with condition number ∞, is contained in a Zariskiclosed set if the r-secant variety σ r is nondefective by [6,Lemma 5.3].
The set of points B (iv) ⊂ S ×r not satisfying (iv) is Zariski-closed by [13,Lemma 4.4]. For the last point, observe that condition (v) of Definition 4.1 is equivalent to p ∈ (S * ) ×r . By definition of S * in (4.2), the set of points in S \ S * is the intersection of S with the union of the following linear varieties: . . , e n k and e i is the ith standard basis vector of R n k . In fact, which is thus a Zariski-closed set because dim S n1,...,n k−1 ,n k −1,n k+1 ,...,n d < dim S. Therefore, taking The definition of M r;n1,...,n d is nice, because the addition map Φ r from (2.2) restricted to M r;n1,...,n d is a local diffeomorphism. However, we wish to work with global diffeomorphisms and therefore need the following proposition. Proof. Combine the proof of Lemma 4.3 with the fact that r-identifiability implies that the rank-1 tensors in a decomposition (A 1 , . . . , A r ) ∈ M r;n1,...,n d are pairwise distinct.
It is clear that the addition map Φ r is constant on S r -orbits in M r;n1,...,n d . Therefore, Φ r is well defined on M r;n1,...,n d . Now, we have the following crucial result. Proof. As before, for brevity, we drop all subscripts. Let a = (A 1 , . . . , A r ) ∈ M. By definition, a has a finite condition number. This means, by [4,Theorem 1.1], that the derivative of Φ r at a is injective. Hence, Φ r is a smooth immersion [32, p. 78]. By the generic r-identifiability assumption, it follows that the r-secant variety σ r is not defective so that dim σ r = r dim S. Moreover, by Proposition 4.5, we have r dim S = dim M r;n1,...,n d and, by construction, we have dim M r;n1,...,n d = dim M r;n1,...,n d . As Φ r is injective by generic r-identifiability and by having taken the particular quotient in Proposition 4.6, then [32,Proposition 4.22(d)] entails that Φ r is a smooth embedding. The first conclusion follows by [32,Proposition 5.2].
The foregoing already shows that N r;n1,...,n d ⊂ σ r is open. We show that it is dense. Let A ∈ σ r \ N r;n1,...,n d with decomposition A = Φ r (a) = A 1 + · · · + A r . By Proposition 4.5, there exist a sequence Note that this is convergence in the usual Euclidean topology that M r;n1,...,n d inherits from the ambient space (R n1×···×n d ) ×r . Consequently, the components also converge individually: . . , r. The result follows from the fact that adding the above convergent sequences results in a convergent sequence in N r;n1,...,n d with limit A. Hence, A ∈ N r;n1,...,n d so that N r;n1,...,n d is dense in σ r , concluding the proof.
From Proposition 4.7, Φ r has an smooth inverse, which solves the TDP on N r;n1,...,n d ⊂ R n1×···×n d . We finally arrive at the goal of this section.
We call this mapping the tensor decomposition map. Remark 4.9. One way to interpret the construction in this section is that near A ∈ N r;n1,...,n d we locally have the identification τ r;n1,...,n d = π • Φ −1 a , where a = (A 1 , . . . , A r ) is any ordered r-nice decomposition of A, Φ −1 a is the local inverse in (1.2), and π is as in Proposition 4.6. 4.3. Implications for the condition number. Let a = (A 1 , . . . , A r ) be any ordered r-nice decomposition in M r;n1,...,n d . For the r-nice tensor A = A 1 + · · · + A r ∈ N r;n1,...,n d , we will relate the condition number κ[τ r;n1,...,n d ](A), as in (2.4), to the condition number of the CPD κ(A 1 , . . . , A r ) from [4]. We have the following result. (1) π is a local isometry; (2) for all A = A 1 + · · · + A r ∈ N r;n1,... Here, dist M and dist M are the respective Riemannian distances.
Because of the equality of condition numbers in Lemma 4.10 and (1.2), we find that for every a = (A 1 , . . . , A r ) ∈ M r;n1,...,n d we have where τ = τ r;n1,...,n d and the last equality follows from (3) in Lemma 4.10. This above equality is very significant because it allows us to make sense of the distance between two unordered CPDs, i.e., sets of rank-1 tensors, {A 1 , . . . , A r } and {B 1 , . . . , B r }. As a consequence, we get an instance of the well-known rule of thumb in numerical analysis: is a matrix that contains the vectorized rank-1 tensors A i (resp. B i ) as columns, and P π is the r×r permutation matrix representing the permutation π. The notation indicates that the bound is asymptotically sharp for infinitesimal A − B F .

Pencil-based algorithms for the CPD
We start by specifying a very general class of numerical algorithms to which the analysis in Section 6 applies. The construction may seem a bit abstract at first sight, so it is useful to keep in mind that the prototypical algorithm from the introduction is an example of a PBA.
As it suffices, in principle, to present a single input for which an algorithm is unstable, we can choose a well-behaved subset of r-nice tensors N * ⊂ N r;n1,n2,n3 ⊂ R n1×n2×n3 (for the exact choice of N * see Definition 5.1 below) and specify what a PBA should compute for such inputs. If the numerical instability already occurs on this subset, then it is also unstable on larger domains. We recall from Section 4 that by considering only r-nice tensors N r;n1,n2,n3 , the TDP consists of computing the action of the function τ r;n1,n2,n3 from Definition 4.8. PBAs compute this map in a particular way, via the four transformations described below.
The input of a PBA is assumed to be the multidimensional array A ∈ R n1×n2×n3 . The first transformation is the multilinear multiplication ρ Q that maps n 1 × n 2 × n 3 tensors to format n 1 × n 2 × 2 via the matrix Q ∈ R n3×2 with orthonormal columns: The second transformation, θ, computes the set of unit-norm columns of the first factor matrix A of the CPD when restricted to N r;n1,n2,2 : θ| Nr;n 1 ,n 2 ,2 : N r;n1,n2,2 → S r;n1 , Herein, S r;n1 = S r;n1 /S r , where S r;n1 is as in (4.3). Note the curious definition of θ involving the restriction to N r;n1,n2,n3 . The reason for this formulation is that a PBA will be executed using floating-point arithmetic. It is unlikely that the floating point representation fl(B) ∈ R n1×n2×2 is exactly in N r;n1,n2,2 ⊂ R n1×n2×2 , even when B ∈ N r;n1,n2,2 . Therefore, a minimal additional demand is placed on θ: For every B ∈ N r;n1,n2,2 , θ must be defined for fl(B).
The third transformation, υ, when restricted to R r;n1,n2,n3 := (A, A) | A = (a 1 , . . . , a r ) ∈ S r;n1 and A = r i=1 a i ⊗ b i ⊗ c i ∈ N r;n1,n2,n3 , essentially computes the Khatri-Rao product B ⊙ C of the remaining factor matrices, namely υ| Rr;n 1 ,n 2 ,n 3 : For the proof of instability in Section 6, it will not matter if or how υ is defined outside of R r;n1,n2,n3 , so we impose no further constraints. The final step computes the (unordered) Khatri-Rao product of two ordered sets of vectors: ⊙ : R p×r × R q×r → S ×r p,q /S r , (x 1 , . . . , x r ), (y 1 , . . . , y r → {x 1 ⊗ y 1 , . . . , x r ⊗ y r }. Applied to A and B ⊙ C, this yields the set of rank-1 tensors solving the TDP. We will define a PBA to be an algorithm composing the above functions. The input space for a PBA is thus N * := ρ −1 Q (N r;n1,n2,2 ) ∩ N r;n1,n2,n3 . (it is the subset N * mentioned at the start of this section). Hence, we arrive at the definition of the class of PBAs for solving a TDP whose input is in N * ⊂ R n1×n2×n3 .
Definition 5.1 (Pencil-based algorithm). A pencil-based algorithm for solving the TDP is an algorithm that computes the tensor decomposition map τ r;n1,n2,n3 when given the n 1 × n 2 × n 3 input array

Pencil-based algorithms are unstable
We continue by showing that PBAs are numerically forward unstable for solving the TDP for third-order tensors. For A ∈ N * ⊂ R n1×n2×n3 let { A 1 , . . . , A r } be the CPD returned by a PBA in floating-point representation. The overall goal in the proof of Theorem 1.2 is showing that for all small ǫ > 0 there exists an open neighborhood O ǫ ⊂ N r;n1,n2,n3 of r-nice tensors such that for A = A 1 + · · · + A r in that neighborhood the excess factor is at least a constant times ǫ −1 . The exact statement is in Theorem 6.1 below. We call ω the excess factor because it measures by how much the forward error 3 produced by the numerical algorithm, as measured by the numerator, exceeds the forward error that one can expect from solving the TDP (which is equivalent to computing the map τ r;n1,n2,n3 ), as measured by the denominator. Showing that the excess factor can become arbitrarily large on the domain of τ r;n1,n2,n3 is essentially equivalent to the standard definition of numerical forward instability of an algorithm for computing τ r;n1,n2,n3 [25]. In fact, the excess factor can be interpreted as a quantitative measure of the forward numerical instability of an algorithm on a particular input. Ideally, ω is bounded by a small constant, but for numerically unstable algorithms ω is "too large" relative to the problem dimensions. The next result is a more precise version of Theorem 1.2 which states that for all A ∈ O ǫ , a PBA becomes arbitrarily unstable as ǫ → 0, irrespective of the problem size.

The key ingredients.
The key observation is that for computing the tensor decomposition map τ r;n1,n2,n3 every PBA computes θ in S2. We will show that the condition number of θ is comparable to the condition number of τ r;n1,n2,2 . Combining this result with the observations from Section 3 and [6], which both demonstrated that the condition number of the tensor decomposition map τ r;n1,n2,2 for n 1 × n 2 × 2 tensors can be much worse than the one of τ r,n1,n2,n3 for n 1 × n 2 × n 3 tensors, motivated our proof of Theorems 1.2 and 6.1.
Let us consider the relation between the tensor decomposition map for n 1 × n 2 × 2-tensors and θ. For brevity, we denote the manifold of r-nice tensors in R n1×n2×2 by N := N r;n1,n2,2 .
The main intuition underpinning the proof of Theorem 6.1 is the following diagram: Herein, η is any map so that τ r;n1,n2,2 = η • (Id N × θ). For example, we could take the map η = τ r;n1,n2,2 • π 1 , where π 1 (x, y) = x projects onto the first factor. For clearly conveying the main idea, let us imagine for a moment that Id N × θ, η, and τ r;n1,n2,2 were smooth (C ∞ ) multivariate functions between Euclidean spaces. For any such functions f, g, we have that is the Jacobian matrix of f at x; see, e.g., [9, Proposition 14.1]. Consequently, for the composite function g • f , we get It thus seems feasible to obtain lower bounds on the condition number of f = Id N × θ in function of the condition numbers of g • f = τ r;n1,n2,2 and g = η.
The key insight is that η should be chosen in such a way that it has a condition number bounded by a constant, so that κ[Id N × θ](B) would be comparable in magnitude to κ[τ r;n1,n2,2 ](B).
Using the above ideas, we will rigorously prove the next lemma in the appendix, which states that the condition number of θ can be bounded from below by the condition number of the tensor decomposition map τ r;n1,n2,2 in some cases. This shows that in some circumstances, the condition number of θ| N is proportional to the condition number of τ r;n1,n2,2 in R n1×n2×2 . Unfortunately, the errors in the computation of θ| N cannot always be corrected, as we prove the following result in the appendix. Lemma 6.3. Let ν > 0 be sufficiently small. Let A = r i=1 a i ⊗ b i ⊗ c i ∈ N * with a i = 1 and factor matrices A, B, C. Let A ∈ R n1×r be a fixed matrix with unit-norm columns and let δ := min π∈Sr A − AP π F . If b i ⊗ c i ≥ 1 − ν for i = 1, . . . , r, δ < 1, and there exists a matrix A ′ ∈ R n1×r with orthonormal columns such that A − A ′ F ≤ ν, then for every B ∈ R n2×r and every C ∈ R n3×r we have This result implies that, even if steps S3 and S4 of a PBA could perfectly recover the rank-1 terms, the PBA would not be able to compensate the error introduced in the computation of θ in step S2. Moreover, under the assumptions of Lemma 6.2, the condition number of θ is proportional to the condition number of τ r;n1,n2,2 . This indicates that the magnification of an input perturbation of a PBA will be roughly proportional to the condition number of the TDP for n 1 ×n 2 ×2 tensors. However, we recall from Section 3 and [5] that there is a great discrepancy between the distribution of the condition numbers of the TDPs for n 1 × n 2 × n 3 and n 1 × n 2 × 2 tensors, the latter being much larger than the former on average. This will then imply that the excess factor ω in (6.1) is large. In the next subsections, we exploit Lemmata 6.2 and 6.3 for showing that ω is actually unbounded.

6.2.
Constructing a bad tensor. The role of O in Theorem 6.1 will be played by the following tensor.
∈ R n2×r be matrices with orthonormal columns. Let U be the n 3 × n 3 matrix U = [ Q ⊥ Q ], where Q ⊥ is an n 3 × (n 3 − 2) matrix whose columns form an orthonormal basis of the complement of the columns of Q. Define the matrix with r columns where 1 k ∈ R k is the vector of ones, and I m×n = [e i ] n i=1 , where e i is the ith standard basis vector of R m . By construction, C ′ has orthonormal columns. The orthogonally decomposable (odeco) tensor associated with these factor matrices is It will satisfy the requirements in Theorem 6.1 and complete the proof of instability of PBAs.
It is a very bad omen that O is not a valid input for PBAs. This is because the projected tensor ρ Q (O) has a positive-dimensional family of decompositions, implying κ[τ r;n1,n2,2 ] = ∞. Indeed,  The next result allows us to apply Lemmata 6.2 and 6.3 for tensors in O ǫ . Lemma 6.5. Let ǫ > 0 be sufficiently small, and let A ′ , B ′ , C ′ be as in the definition of O in (6.4). Then, there exists a constant S > 0 so that for all (A 1 , . . . , A r ) ∈ O ǫ with factor matrices A, B, C, where both A and B have unit-norm columns, the following bounds holds: This lemma is proved in the appendix. Combining these two lemmata with Lemmata 6.2 and 6.3, we get the following important corollary. Corollary 6.6 (An r-nice bad tensor). Let ǫ > 0 be sufficiently small.
be an element of O ǫ such that the factor matrices A ∈ R n1×r and B ∈ R n2×r have unit-norm columns. Then, (1) A ∈ N * , i.e., A is r-nice and its projection B = ρ Q (A) is also r-nice; (2) there exists an A ′ ∈ R n1×r with orthonormal columns, such that A − A ′ F ≤ Sǫ; and (3) B ⊙ C ∈ R n3×r has columns whose norms are bounded by 1 − Sǫ ≤ b i ⊗ c i ≤ 1 + Sǫ.
6.3. Proof of Theorem 6.1. Let A ∈ O ǫ be as in Corollary 6.6. Its floating-point representation is A := fl(A). We show that the excess factor ω(A) from (6.1) is proportional to ǫ −1 . We assume that the output of step S1 is the best possible numerical result when providing A as input, namely the floating-point representation of A)). For streamlining the analysis, we ignore further compounding of roundoff errors, assuming the best possible case in which the PBA manages to execute steps S2, S3 and S4 exactly (perhaps by invoking an oracle). Let {a 1 , . . . , a r } = θ( B) and A := [a i ] i . Then, by the same construction as in Section 4.3, we have In fact, a small component in the direction of the worst perturbation is expected: From the concentration-of-measure phenomenon, assuming that the perturbation B − B is random with no preferred direction, it follows with high probability that the component of the perturbation in the worst direction is of size comparable to (r dim S n1,n2,2 ) − 1 2 = (r(n 1 + n 2 )) − 1 2 . See Armentano's work [1] for an analysis of the impact of this consideration in the observed value of the so-called stochastic condition number. It follows that there exists a number 1 ≥ β 1 > 0 such that where the last inequality is by definition of condition numbers and restrictions of maps. Applying Lemma 6.2 and using the properties from Corollary 6.6, yields where Z := Q T C. Assume that the left-hand side is bounded from above by 1. Regardless of the particular { b i ⊗ c i } i that the PBA computes in step S3, invoking Lemma 6.3 shows that after completion of step S4 the forward error satisfies Dividing both sides of this expression by κ[τ r;n1,n2,n3 ](A) · A − A F gives the excess factor ω(A): We continue by bounding the factor B − B F A − A −1 F in this inequality. Since B = fl(B) and A = fl(A), we have in the standard model of floating-point arithmetic, where |δ i1,i2,i3 | ≤ ǫ u , and where |δ i1,i2,i3 | ≤ ǫ u . There exists a β 2 ≥ 0 so that B − B F = β 2 ǫ u B F . While a detailed analysis of the value of β 2 is outside of the scope of this work, it is reasonable to assume that β 2 is not too small, 4 say β 2 ≥ 10 −1 . Hence, we need bounds on the norms of A and B. To this end, 1 n 1 n 2 n 3 If all b i 1 ,i 2 ,i 3 are roughly proportional, i.e., b 2 i 1 ,i 2 ,i 3 ≈ 1 n 1 n 2 n 3 B 2 F , then β 2 ≈ 308 √ 3 log 2 ≈ 123.
the following well-known result is useful.
where e i is the ith standard basis vector of R r . Let The norms of A and B are then estimated as follows: Exploiting the linearity of the multilinear multiplication ρ Q we also have where we used that the 3-flattening of We have thus shown that The Finally, we bound κ[τ r;n1,n2,2 ](B). Let d s = r dim S n1,n2,2 , and recall that z i := Q T c i . Recall T from the proof of Lemma 3.1, applying it to B's CPD. Consider the next submatrix of T , and set v ′ := z 1 a 2 z 2 b 1 .
Note that v ′ 2 = z 1 2 + z 2 2 . From the identification of condition numbers from Lemma 4.10 and from the steps in the proof of Lemma 3.1 it follows that We already showed above that Z −Q T C ′ F ≤ Sǫ and that z ′ Consequently, we get the bounds Assuming that ǫ is sufficiently small, we obtain v ′ ≥ 4 n3 − Sǫ. Plugging all of these into (6.8) yields Plugging (6.7) and (6.9) into (6.6), the proof of Theorem 6.1 is concluded. This ultimately completes the proof of Theorem 1.2. Remark 6.7. It is important to observe that the construction of the open set O ǫ depends on the projection operator ρ Q and, hence, on Q ∈ R n3×2 . That is, we have shown that regardless of a choice of Q that is independent of A, there exists an open set such where the PBA with projection ρ Q is unstable. The above construction does not automatically apply to situations where Q is chosen as a function of the input A.

Numerical experiments
We present the results of some numerical experiments in Matlab R2017b for supporting the main result and exemplifying the behavior of PBAs on third-order random CPDs. They were performed on a computer system consisting of two Intel Xeon E5-2697 CPUs (12 cores, 2.6GHz each) with 128GB of main memory.
Three PBAs are considered in the experiments below, which we refer to as cpd pba, cpd pba2 and cpd gevd, respectively. 5 The first, cpd pba, is an ordinary implementation of the prototypical PBA discussed in Section 1, using ST-HOSVD [44] as orthogonal Tucker compression. cpd pba2 computes the CPD by randomly projecting the input tensor A with ρ Q , then employing the cpd function from Tensorlab v3.0 to recover the two factor matrices A and B, and finally computing (3) ) T to obtain a representative of the set of rank-1 tensors. The last PBA we consider is the cpd gevd function from Tensorlab v3.0. The analysis in Section 6 does not strictly apply to the default settings 6 of cpd gevd, because it chooses the projection matrix Q as a function of the input tensor A ∈ R n1×n2×n3 . Specifically, if (U 1 , U 2 , U 3 ) · S = A is the HOSVD [16], then cpd gevd chooses Q as the first two columns of U 3 .
Throughout these experiments, the forward error of the TDP is evaluated as follows.
is the forward error. Evaluating all r! permutations is a Herculean task when r ≫ 10. Fortunately, when A and A ′ are very close, the optimal permutation can be found heuristically by solving the linear least-squares problem min X∈R r×r A ⊙ B ⊙ C − (A ′ ⊙ B ′ ⊙ C ′ )X F and then projecting the minimizer to the set of permutation matrices by setting the largest value in every row to 1 and the rest to zero. In all experiments, the forward error is computed in this manner. 7.1. The bad odeco tensor. We start with an experiment to support the analysis of Section 6. Let ρ Q = Id ⊗ Id ⊗Q T , where Q ∈ R n3×2 , be the projection operator of the PBA. Let A ∈ R n1×n2×n3 be an r-nice tensor whose CPD is ǫ-close to the odeco tensor (6.4), i.e., A ∈ O ǫ , where the latter is as in Lemma 6.4. According to the analysis in Section 6, the excess factor ω(A) of a PBA with projection operator ρ Q should behave like O(ǫ −1 ).
We consider 89 × 29 × 11 tensors. Q ∈ R 11×2 , A ′ ∈ R 89×10 and B ′ ∈ R 29×10 were respectively generated by computing the Q-factor of the QR-decomposition of a matrix with i.i.d. standard normal entries. The matrix C ′ ∈ R 11×10 was constructed as in the definition of (6.4). Then, . . , 10. For k = 1, . . . , 50, we constructed the randomly perturbed tensors P k,i = O i + 2 −k X k,i X k,i F , where X k,i has i.i.d. standard normal entries. Using the cpd function with default settings from Tensorlab, we then computed the rank-1 approximations A k,i of P k,i . Let ǫ k := max { A k,i − O i F } i , and then the corresponding tensor is A k = 10 i=1 A k,i , so that A k ∈ O ǫ k with probability 1. Let a * k = {A k,1 , . . . , A k,10 } denote the true CPD. A rank-10 CPD a k ∈ S ×r /S r of A k was computed numerically using cpd pba and the forward error relative to a * k was computed. We also applied cpd with default settings to A k for numerically computing another rank-10 CPD a ′ k . The forward error between a * k and a ′ k was recorded. The results of the above experiment are shown in Figure 7.1. cpd attains a forward error of approximately 4·10 −16 in all cases. As the random tensors are very close to the odeco tensor, their condition numbers are approximately 1. A forward error equal to a small multiple of the machine precision 1.11 · 10 −16 is thus anticipated from a stable algorithm. The situation is dramatically different for cpd pba. Since the odeco tensor was chosen to behave badly with respect to the projection ρ Q , we expect from Section 6 that the forward error of the PBA grows like the excess factor ω = O(ǫ −1 ). The dashed line in Figure 7.1 shows the result of fitting the model kǫ −1 to the data with ǫ > 10 −14 . As can be seen, the experimental data match the predictions from the theory in Section 6 very well, specifically with regard to the growth rate of the excess factor. 7.2. Distribution of the excess factors. The previous experiment illustrated the forward error in worst possible case that we know of, mainly to illustrate Theorem 1.2. Based on the construction in Section 6, it is not reasonable to expect that this will correspond to the typical behavior. However, the next experiment shows that, unfortunately, one should typically expect a loss of precision of at least a few digits.
The setup is as follows. For each tested tensor shape n 1 × n 2 × n 3 , we generated 10 5 random rank-n 2 CPDs {a 1 ⊗ b 1 ⊗ c 1 , . . . , a r ⊗ b r ⊗ c r } by sampling the entries of the vectors a i ∈ R n1 , b i ∈ R n2 and c i ∈ R n3 i.i.d. from a standard normal distribution. The corresponding tensor Recall that cpd by default will use the PBA cpd gevd as initialization and will then refine its output by running a quasi-Newton method; see [41,45]. The stopping criterion for cpd was where ǫ u ≈ 1.1 · 10 −16 is the unit roundoff of standard double precision floating point arithmetic, and A ′ i are the rank-1 tensors. The forward error of cpd will thus be bounded approximately by 2 √ 10κ(A 1 , . . . , A r ) · ǫ u . Recalling the shape of the ccdfs of the condition number from Figure 3.1, we again note that as n 3 increases, the likelihood of large condition numbers diminishes. In fact, most of the generated TDPs were well-conditioned, as can be inferred from the figure by noting that the forward error of cpd is always less than 10 −11 .
The loss of precision of the two PBAs is very pronounced in Figure 7.2. Although cpd gevd is not strictly a PBA, because its projection operator depends on the tensor, its loss of precision in Figure 7.2 asymptotically matches that of the PBAs. Note the seemingly asymptotic log-linear relationship between the probability P[ω > x] and x in the right plots in Figure 7.2; that is, it seems plausible that asymptotically P[ω > x] = ax −1 for some a > 0. A possible explanation of this behavior follows from our geometrical interpretation of the causes of instability. The inputs A for which we expect ω(A) > x with large x are those such that c i ≈ c j and yet Q T c i ≈ Q T c j for some i = j. This is more likely to happen if n 3 is large, since c i ∈ S(R n3 ) and Q T c i ∈ S(R 2 ). Indeed, the extreme case Q T c i = Q T c j , for some i = j, corresponds to a hypersurface L of S(R n3 ) ×r . If we realize that Q T c i ≈ Q T c j is similar to the property of being close to L, then we expect ω > x to happen in some neighborhood of radius comparable to 1/x around L. This neighborhood will have a volume of the order of x −1 , qualitatively explaining the observed behavior.

Conclusions
We proved in Theorem 1.2 that popular pencil-based algorithms for computing the CPD of low-rank third-order tensors are numerically unstable. Moreover, not only do there exist inputs for which such algorithms are unstable, the numerical experiments suggest that for certain random CPDs the loss of precision is roughly O(− log 10 (ǫ)) with probability ǫ. In addition to these results, we bounded the distribution of condition numbers of random CPDs, in Theorem 1.4. The empirical cumulative distribution function of the forward error err forward and the multiplication factor µ for the standard PBA from Section 1, the cpd gevd and cpd functions from Tensorlab, and the cpd function from Tensorlab initialized with the factor matrices obtained with the PBA applied to rank-n 2 tensors of size n 1 × n 2 × n 3 .
The main conclusion of our work is this: PBAs should be handled with care, as the numerical experiments in Section 7 demonstrated that an excess loss of precision is probable. When the most accurate result is sought, we advise to apply a Newton-based refinement procedure to the output of a PBA. This is in fact the default strategy pursued by the cpd function from Tensorlab v3.0. While this strategy is certainly advisable when the input is perturbed only by roundoff errors, it is not clear to us whether employing a PBA for generating a starting point for an iterative method is more effective than a random initialization in the presence of significant (measurement) errors in the input data, both for reasons of conditioning (Theorem 1.4) and stability (Theorem 1.2). We believe that a further study on this point is required.
We hope that the construction of inputs for which PBAs are unstable, in Section 6, offers insights that can help in the design of numerically stable algorithms for computing CPDs. Our analysis suggests that methods partly recovering the rank-1 tensors from a matrix pencil are numerically unstable in the neighborhood of some adversarially chosen inputs.
Finally, we emphasize that the reason why PBAs are numerically unstable is caused by transforming the tensor decomposition problem into a more difficult computational problem that is nevertheless perceived to be easier to solve, probably because there are direct algorithms for solving them. Here is thus a decidedly positive message that we wish to stress: computing a CPD can be easier, from a numerical point of view, than solving the generalized eigendecomposition problem for a projected tensor. We hope that these observations may (re)invigorate the search for numerically stable algorithms for computing CPDs. For (1) we just refer to [35, Section 2.3] which covers our case since the group S r acts by isometries on M. Therefore, the induced metric g on M is the pushforward g := π * g of the Riemannian metric g on M that is inherited from the standard product of inner products on the ambient Euclidean space (namely (R n1···n d ) ×r ) of M ⊂ S ×r ⊂ (R n1···n d ) ×r . We denote by h the metric on N which is given by the standard Euclidean inner product ·, · that N inherits from the ambient space R n1×···×n d ≃ R n1···n d .
It will be insightful to describe the metric g on M more concretely. Let a = (A 1 , . . . , A r ) ∈ M be an arbitrary ordered r-nice decomposition, and let a := π(a) denote the corresponding CPD. Let π −1 a be the smooth local section with ( π −1 a • π)(a) = a. The pushforward g = π * g is defined (see [32, p. 183]) as the map satisfying g a ( s, t) := g a (d π(a) π −1 a ( s), d π(a) π −1 a ( t )) for all Using the identification T a M ≃ T a M which is given by the isometry d a π we can denote t = {t 1 , . . . , t r } with t i ∈ T Ai S. Similarly, we can write s = {s 1 , . . . , s r } with s i ∈ T Ai S. Then, it follows that where t M, a is the induced norm on T a M. From the foregoing discussion it indeed follows for every choice of a ∈ π −1 (τ (A)) that where the second equality is by the definition of the metric, the third by the linearity of derivatives, and the final equality is precisely Theorem 1.1 of [4]. This finishes the proof of (2). Finally, (3) follows from the fact that π is a local isometry and thus preserves the lengths of curves. Given any curve joining two elements in M, its lift through the covering π thus has the same length. Since we are free to choose the representative, we thus choose one that minimizes the length of the lifted curve.
Consider again the diagram from (6.2). Note that N , M, and N × S are manifolds. We claim that Θ = Id N ×θ| N and η are smooth maps between manifolds. We can explicitly write η as where A = [a i ] i ∈ S is a n 1 × r matrix with the a i 's as columns in any order; B (1) = A(B ⊙ Z) T is the 1-flattening [31] of B = r i=1 a i ⊗ b i ⊗ z i ; and with a minor abuse of notation π is the smooth map that takes a matrix and sends it to the set of its columns. By assumption r ≤ n 1 so that S is the manifold of matrices with linearly independent unit-norm columns. Therefore, A † = (A T A) −1 A T for all A ∈ S, which is a smooth map. Consequently, η is a smooth map, by [32,Proposition 2.10 (d)]. Let Ψ * n1,...,n d be the map from (4.1). Then, we have θ| N = π • π 2 • (Ψ * n1,n2,2 ) −1 ×r • τ, where π 2 : R \ {0} × S + (R n1 ) × S + (R n2 ) × S + (R n3 ) → S + (R n1 ) projects onto the second factor. The projection π is a local diffeomorphism by Lemma 4.3, the coordinate projection π 2 is smooth, Ψ * n1,n2,2 is a diffeomorphism, and τ is a diffeomorphism by Proposition 4.7. Therefore, θ| N is smooth, by [32, Proposition 2.10(d)], and so Θ is smooth.
Recall that the spectral norm of a linear operator F : V → W , where V and W are normed vector spaces with respective norms · V and · W , is F V,W := max t∈V For composable maps, the foregoing spectral norms are submultiplicative. Since τ = Θ • η is a composition of smooth maps between manifolds, we have that d A τ = d Θ(A) η • d A Θ. Therefore, where the last step follows from the definition in (2.4). Note that this generalizes (6.3).
We can write the condition number of Θ as a function of the condition number of θ| N . Indeed, let t ∈ T A N be arbitrary, and observe that Exploiting that √ 1 + x 2 ≤ 1 + |x| for all x ∈ R, we thus find The proof will be completed by bounding κ[ η](Θ(A)) from above. As Riemannian metric on N × S we choose the product metric of the natural Riemannian metric on N , which is inherited from the ambient R n1×n2×2 ≃ R n1n22 , and the Riemannian metric that is the pushforward of the standard Euclidean inner product that S inherits from R n1×r via the map π : S → S, which is also a local isometry by the same arguments as in the proof of Lemma 4.10. Let A = [a i ] i ∈ S be a factor matrix of B = r i=1 a i ⊗ b i ⊗ z i , which thus imposes an order on the a i 's. Let us denote the other two factor matrices by B = [b i ] i ∈ S r;n2 (the b i 's are in GLP) and Z = [z i ] i ∈ R 2×r . Since N × S is locally isometric to N × S, there is a local section π −1 A of π. As M is locally isometric to M via π, there is also a local section π −1 ⋆ that is consistent with A in the sense that The derivative of η is computed as follows. We note that whereȦ is a tangent vector in T A S r;n1 . We find that Let (Ḃ,Ȧ) be a maximizer of (A.2). Note that Ḃ F ≤ 1 and Ȧ F ≤ 1. Since A ⊙ (A † B (1) ) T is a submatrix of A ⊗ (A † B (1) ) T , it follows that A ⊙ (A † B (1) ) T F ≤ A F A † B (1) F . Exploiting this inequality and the triangle inequality a few times, we obtain The right-hand side is a Lipschitz continuous function in (B, A) ∈ R n1×n2×2 × R n1×r , say with Lipschitz constant ℓ > 0.
By assumption there is a matrix A ′ = [a ′ i ] i with orthonormal columns with A − A ′ F < ν. Let B ′ be the tensor with factor matrices A ′ ,B, Z; that is, Then, by the triangle inequality and the computation rules for inner products of rank-1 tensors from (2.1), where the last step is because b i ⊗ z i F < 1 + ν for each i. This shows that (B, A) − (B ′ , A ′ ) F ≤ r 2 ν 2 (1 + ν) 2 + ν 2 = ν r 2 (1 + ν) 2 + 1.
Assume that ν ≤ 1 and let us write L := ℓ √ 4r 2 + 1. Then, using the Lipschitz continuity from above, A ′ F = √ r and (A ′ ) † = (A ′ ) T we find Recall that for matrices X, Y we have the inequality XY F ≤ min{ X 2 Y F , X F Y 2 }. Observe that (A ′ ) T B ′ (1) = B ⊙ Z and (A ′ ) T 2 = 1. Exploiting these we obtain where in the last step we assumed that νL ≤ r. Plugging this into (A.1) finishes the proof.
A.3. Proof of Lemma 6.3. Observe that B ⊙ C can naturally be regarded as a matrix in the space R n2n3×r . Therefore, ε := min π∈Sr A ⊙ B ⊙ C − ( A ⊙ B ⊙ C)P π F ≥ min π∈Sr min M∈R n 2 n 3 ×r A ⊙ B ⊙ C − ( A ⊙ M )P π F , where P π is the permutation matrix corresponding to π. Let π ∈ S r be any permutation. Then, min M∈R n 2 n 3 ×r A ⊙ B ⊙ C − ( A ⊙ M )P π F = min M∈R n 2 n 3 ×r where the last step is because of the definition of the Khatri-Rao product, and because every M ∈ R n2n3×r can be factored as (M P −1 π )P π since P π is invertible. Let m 1 , . . . , m r be the columns of M . Then, we have that is a sum of squares, so that we can minimize each m i separately. The first-order necessary optimality conditions are ( a πi ⊗ I n2n3 ) T (a i ⊗ (b i ⊗ c i ) − a πi ⊗ m i ) = 0, i = 1, . . . , r.
Solving for m i yields the unique solution m i = a πi , a i b i ⊗ c i . Plugging this minimizer into the ith term in the right-hand side of (A.3), we find where we used the computation rules for inner products from (2.1) in the first step, and the assumption that b i ⊗ c i F ≥ 1 − ν in the last step. From this it follows that min M∈R n 2 n 3 ×r π1 , a 1 , . . . , a πr , a r ) 2 F .