A nonuniform fast Fourier transform based on low rank approximation

By viewing the nonuniform discrete Fourier transform (NUDFT) as a perturbed version of a uniform discrete Fourier transform, we propose a fast, stable, and simple algorithm for computing the NUDFT that costs $\mathcal{O}(N\log N\log(1/\epsilon)/\log\!\log(1/\epsilon))$ operations based on the fast Fourier transform, where $N$ is the size of the transform and $0<\epsilon<1$ is a working precision. Our key observation is that a NUDFT and DFT matrix divided entry-by-entry is often well-approximated by a low rank matrix, allowing us to express a NUDFT matrix as a sum of diagonally-scaled DFT matrices. Our algorithm is simple to implement, automatically adapts to any working precision, and is competitive with state-of-the-art algorithms. In the fully uniform case, our algorithm is essentially the FFT. We also describe quasi-optimal algorithms for the inverse NUDFT and two-dimensional NUDFTs.

1. Introduction. The nonuniform discrete Fourier transform (NUDFT) is an important task in computational mathematics that appears in signal processing [4], the numerical solution of partial differential equations [20], and in magnetic resonance imaging [12]. Quasi-optimal algorithms for computing the NUDFT are referred to as nonuniform fast Fourier transforms (NUFFTs), and state-of-the-art NUFFTs are usually based on oversampling, discrete convolutions, and the fast Fourier transform (FFT) on an oversampled grid [10,15,23,28]. In this paper, we propose a NUFFT that is embarrassingly parallelizable. It is numerically stable without the need for oversampling, and costs K FFTs, where K is a carefully selected integer. Our central idea is to exploit a low rank observation (see (1.3)).
Since (1.1) involves N sums with each sum containing N terms, computing the vector f naively costs O(N 2 ) operations. If the samples are equispaced, i.e., x j = j/N , and the frequencies are integer, i.e., ω k = k, then the transform is fully uniform and (1.1) can be computed by the FFT in O(N log N ) operations by exploiting algebraic redundancies [8]. Unfortunately, these algebraic redundancies are "brittle" [5] and the ideas behind the FFT are not immediately useful when either the samples are nonequispaced or the frequencies are noninteger. To develop a NUFFT, one has to exploit a nonzero working precision of 0 < ǫ < 1 and make careful approximations. There are three types of NUDFTs [15]: • NUDFT-I (Uniform samples and noninteger frequencies): In (1.1) the samples are equispaced, i.e., x j = j/N , and the frequencies ω 0 , . . . , ω N −1 are noninteger. This corresponds to evaluating a generalized Fourier series at equispaced points. In where x 0 , . . . , x N −1 are sample points. Therefore, a NUFFT-II is simply a quasioptimal complexity algorithm for computing the matrix-vector productF 2 c. In the fully uniform case when x j = j/N and ω k = k, we use the notation F jk = e −2πijk/N for the DFT matrix and note that the FFT algorithm computes F c in O(N log N ) operations [8].
Our NUFFT-II algorithm is based on the simple observation that if the samples are near-equispaced, thenF 2 ⊘ F can be well-approximated by a low rank matrix. 1 That is, for a small integer K (see Table 2.1), we find that where '⊘' denotes the Hadamard division, i.e., C = A⊘B means that C jk = A jk /B jk . With (1.3) in hand, we havẽ where '•' is the Hadamard product 2 and D u is the diagonal matrix with the entries of u on the diagonal. Therefore, the NUFFT-II can be computed in O(KN log N ) operations via K diagonally-scaled FFTs. The approximation in (1.4) is the main idea in this paper. All that remains is to select the integer K and compute the vectors 1 A similar observation was made in [19,Sec. 4], but we believe that it has not been developed into a practical algorithm. A different Hadamard product matrix decomposition was exploited in [26] to derive a fast Chebyshev-to-Legendre transform.
u 0 , . . . , u K−1 , v 0 , . . . , v K−1 . The observation will lead to a NUFFT-II algorithm that is quasi-optimal for any set of samples and frequencies (see Section 2) and similar observations lead to our NUFFT-I and NUFFT-III algorithms. The major computational cost of our NUFFTs is K FFTs that can be performed in parallel, where K is an adaptively selected integer that depends on the working precision 0 < ǫ < 1 and the distribution of the samples and frequencies. This allows us to reduce the cost of our NUFFTs -by reducing K -when the working precision is loosened, the samples are near-equispaced, or the frequencies are close to being integers. In particular, if any of our NUFFT codes are given equispaced samples and integer frequencies, then K = 1, and our implementation reduces to a single FFT. By always computing the NUDFT via K FFTs, we are able to leverage the efficient FFTW library that has an implementation of the FFT that adapts to individual computer architectures [13]. Our algorithm relies on FFTs that are of the same size as the original NUFFT and we automatically exploit the distribution of the samples and frequencies if they happen to be quasi-uniform for extra computational speed.
There are many other NUFFTs in the literature based on various ideas such as discrete convolutions and oversampling [10,15,23], min-max interpolation [12], oversampling and interpolation [7], and a Taylor-based approach [1]. The Taylor-based approach results in an easily implementable algorithm, which is avoided in practice because it is numerically unstable [18,Ex. 3.10]. For the last two decades, discrete convolutions and oversampling have been preferred. The transforms that we develop here are convenient and simple while being numerically stable. We benchmark our algorithms against the Julia implementation of the NFFT software [22] to demonstrate that our proposed algorithm is competitive with existing state-of-the-art approaches.
The paper is structured as follows. In Section 2, we derive the NUFFT-II algorithm by first assuming that the nonuniform samples are a perturbed equispaced grid (see Section 2.1) before generalizing to any distribution of samples (see Section 2.2). In Section 3 we extend the algorithm to derive a NUFFT-I, NUFFT-III, and inverse transforms. In Section 4, we describe the two-dimensional analogue of our NUFFT-II.
2. The nonuniform fast Fourier transform of type II. In this section, we describe an O(N log N log(1/ǫ)/ loglog(1/ǫ)) algorithm to compute the NUDFT-II of size N (see (1.2)) with a working precision of 0 < ǫ < 1. We begin by making the simplifying assumption that the samples x 0 , . . . , x N −1 are nearly equispaced before describing the general algorithm.
2.1. Samples are a perturbed equispaced grid. Suppose that the samples x 0 , . . . , x N −1 are distributed such that there exists a parameter 0 ≤ γ ≤ 1/2 satisfying This assumption guarantees that a closest equispaced point to x j is j/N , which simplifies the description of our algorithm. Using the fact that ω k = k for 0 ≤ k ≤ N − 1 and properties of the exponential function, we can factor the entries ofF 2 as which shows that the (j, k) entry ofF 2 can be written as a complex number multiplied by the (j, k) entry of the DFT matrix. The expression in (2.2) gives us the following matrix decomposition:F where '•' is the Hadamard product. The observation in (1.3) is equivalent to the matrix A being well-approximated by a low rank matrix so that (2.4) Therefore, an approximation toF 2 c can be computed in O(KN log N ) operations via the FFT as each term in the sum in (2.4) involves diagonal matrices and the DFT matrix. Moreover, each matrix-vector product in the sum can be computed independently and the resulting vectors added together afterwards.
All that remains is to show that A can in fact be well-approximated by a low rank matrix, or equivalently, that K is relatively small, and to construct a low rank approximation A K for A. We cannot use the singular value decomposition for this 3 because that costs O(N 3 ) operations and would dominate the algorithmic complexity of the NUFFT-II. Instead, we note that A can be viewed as a matrix obtained by sampling e −ixy at points in [−γ, γ]×[0, 2π] and we construct a low rank approximation via an approximation of the function e −ixy .
2.1.1. Low rank approximation via Taylor expansions. A natural way to construct a low rank approximation to A is via Taylor expansion by exploiting the fact that (x j − j/N )k is relatively small for 0 ≤ j, k ≤ N − 1. The NUFFT developed here is equivalent to [1] (without oversampling) and is numerically unstable. In this direction, consider the Taylor expansion of e −x = 1 − x + x 2 /2 − x 3 /6 + · · · about x = 0. Applying this Taylor series to each entry of A, we find that for 0 ≤ j, k ≤ N −1 where the expansion is truncated after K terms to deliver an approximation. Now, if we let x = (x 0 , . . . , x N −1 ) ⊺ , e = (0, 1/N, . . . , (N − 1)/N ) ⊺ , and ω = (0, 1, . . . , N − 1) ⊺ , then (2.5) can be applied to each entry of A to find that Here, the notation x y ⊺ denotes a rank 1 matrix, exp(x y ⊺ ) is the matrix formed by applying the exponential function entry-by-entry to x y ⊺ , and (x y ⊺ ) r is the entry-byentry rth power of x y ⊺ . Since |2πi(x j − j/N )k| ≤ 2πγ for 0 ≤ j, k ≤ N − 1, error estimates for the truncated Taylor expansion of e −x for x ∈ [0, 2πγ] shows that A − A K max ≤ ǫ for K = O(log(1/ǫ)) [1], where · max is the absolute maximum matrix entry. To avoid overflow issues, one should take the vectors u r = (N (x − e)) r and v r = (−i) r (2πω/N ) r /r! for 1 ≤ r ≤ K in (2.4). Unfortunately, we observe that the Taylor-based approach is numerically unstable (even with modest oversampling) in agreement with the experiments in [18,Ex. 3.10]. This is because for moderate K (≥ 7) the matrix A K is constructed by evaluating high-degree monomial powers. For this reason, the NUFFT-II described in [1] is seldom used. We must construct the matrix A K in a different way.
2.1.2. Low rank approximation via Chebyshev expansions. One can often stabilize high-degree Taylor expansions by replacing them with Chebyshev expansions. We do that now.
For an integer p ≥ 0, the Chebyshev polynomial of degree p is given by T p (x) = cos(p cos −1 x) on x ∈ [−1, 1] and the set {T 0 , T 1 , . . . , T K−1 } is an orthogonal basis for the space of polynomials of degree at most K − 1, with respect to the weight function 1]. We can use a Chebyshev series to represent nonperiodic functions, in the same way that a Fourier series can represent periodic functions [27].
In the Appendix in Theorem A.2, we derive a low rank approximation for A by using Chebyshev expansions. If γ = 0, then A is the matrix of all ones and the low rank approximation is trivial. If γ > 0, then for 0 < ǫ < 1 we find an integer K (see (2.7)) and a matrix A K such that A − A K max ≤ ǫ, where · max denotes the absolute maximum matrix entry. The matrix A K is defined by (see Theorem A.2) where 1 is the N × 1 vector of ones and the primes on the summands indicate that the first term is halved. The coefficients a pr for 0 ≤ p, r ≤ K − 1 are known explicitly as where J ν (z) is the Bessel function of parameter ν at z [21, Chap. 10]. Here in (2.6), exp(x) and T s (x) denote the exponential and Chebyshev polynomial evaluated at each entry of x to form another vector, respectively. The expansion in (2.6) provides us with a rank K matrix that approximates A as A = lim K→∞ A K . From the convergence properties of Chebyshev expansions, for each fixed K, an explicit upper bound is known for ) are evaluated via computing the Chebyshev polynomials using a three-term recurrence relation [21, Tab. 18.9.1]. This requires a total of O(K 2 N ) operations. This cost should strictly be included in the final complexity of the NUFFT-II, but we will not include it because this is part of the "planning stage" (see Section 2.3).
x ≥ 0. By asymptotic approximations of W (x) as x → ∞, we find that K = O(log(1/ǫ)/ loglog(1/ǫ)) as ǫ → 0 [21, (4.13.10)] and hence,F 2 c can be computed in a total of O(N log N log(1/ǫ)/ loglog(1/ǫ)) operations using (2.4). It is relatively common in practice to have perturbed equispaced samples so we always compute the parameter 0 ≤ γ ≤ 1/2 in (2.1) in order to select the smallest possible integer K with A − A K max ≤ ǫ. In our implementation of the NUFFT-II, we do not use the formula for K in (2.7) because it is only asymptotically sharp and the constants are not tight. Instead, in (2.6) we use the values of K given in Table 2.1, which are selected from empirical observations. In particular, in double precision we use at most K = 16, corresponding to the cost of a NUFFT-II being approximately 16 FFTs of size N .
In practice, it is also common to not always need a working precision of ǫ ≈ 2.2 × 10 −16 so we adaptively select the integer K based on that parameter too. For example, with a working precision of 9.8 × 10 −4 the NUFFT-II costs at most seven FFTs of size N .

Arbitrarily distributed samples.
Suppose that the samples x 0 , . . . , x N −1 in (1.2) are arbitrarily distributed real numbers. The properties of the complex exponential, x → e −2πixk for 0 ≤ k ≤ N − 1, allow us to assume, without loss of generality, that the samples are in the interval [0, 1); otherwise, they can be translated to that interval using periodicity. For convenience, in this section we assume that x 0 , . . . , x N −1 ∈ [0, 1), though our implementation does not have this restriction. In this general setting, the observation in (1.3) is no longer valid because the samples are arbitrarily distributed.
Instead, define a sequence s 0 , . . . , s N −1 that takes values from {0, . . . , N } and is defined so that s j /N is the closest node to x j from an equispaced grid of size N + 1 (ties can be broken arbitrarily). Since each x j is a distance of at most 1/(2N ) from these equispaced nodes, we have Here 0 ≤ x 0 ≤ x 1 ≤ · · · ≤ x N −1 < 1, but the samples do not necessarily need to be ordered. Since s 7 = 8, the sample x 7 is later assigned to the equispaced point at 0 (see (2.9)).
If s j = N then we must reassign x j because the uniform DFT does not contain a sample at s j /N = 1. Using the periodicity of the complex exponential, we use the identity e −2πis j k/N = e −2πi0k/N = 1 to assign x j to the equispaced node at 0. This can be done simply by defining another sequence t 0 , . . . , t N −1 , which takes values from {0, . . . , N − 1}, and is given by In practice, one can easily compute the vector t directly from x since where mod(a, N ) is the modulo-N operation on each entry of a.
From the properties of the exponential function and the definition of t 0 , . . . , t N −1 , we find that This means that the (j, k) entry ofF 2 can be expressed as a product of e −2πi(x j −s j /N )k and the (t j , k) entry of F for 0 ≤ j, k ≤ N − 1. Equivalently, by setting A jk = e −2πi(x j −s j /N )k for 0 ≤ j, k ≤ N − 1, we can write (2.10) as the following matrix decomposition: Note that F (t, :) denotes the matrix formed by extracting the rows indexed by (t 0 , . . . , t N −1 ) from the DFT matrix.
Since N (x j − s j /N ) ∈ [−1/2, 1/2] and k/N ∈ [0, 1], we find that A can be wellapproximated by a low rank matrix using the same idea as in Section 2.1.2. This leads to the rank K approximation A K to A, given by where s = (s 0 , . . . , s N −1 ) ⊺ . Here, A − A K max ≤ ǫ for some 0 < ǫ < 1 and K is the value in (2.7) with γ = 1/2. In summary, we find that the matrix-vector product,F 2 c, can be approximately computed with a working accuracy of 0 < ǫ < 1 via the approximatioñ  (2N ). Naively, since our algorithm is numerically stable without oversampling, it would seem that oversampling is never beneficial for us. For example, in double precision if M = 2N , then 13 FFTs of size 2N are required (see Table 2.1) instead of 16 FFTs of size N . In practice, it is a little more complicated as one may benefit from selecting an integer N ≤ M < 2N that has a convenient prime factorization for the FFT [8]. We have not explored this possibility yet. • Vectorization: One can vectorize the FFTs in (2.11) by computingF 2 c in two steps: where X k denotes the kth column of X. In the programming language Julia [6] this can be implemented in the one-liner: and the variable t is the vector t.
• Planning the transform: Most implementations of fast transforms these days have a planning stage [13], where ancillary quantities are computed that do not depend on the entries of c. This stage may also involve memory allocation and the finalization of recursion details [13]. For our NUFFT-II, the planning stage consists of computing γ, t, and K, planning the FFTs [13], as well as computing the vectors u 0 , . . . , u N −1 , v 0 , . . . , v N −1 for the low rank approximation A K . These quantities and data structures are then stored in memory so that the NUFFT-II is computationally faster. After the planning stage of our NUFFT-II, there is an online stage, where the transform is essentially the one-liner for the nufft2(c) call above. It is particularly important to plan a NUFFT-II when the matrix-vector product withF 2 is desired for many vectors.

Numerical results.
We have two different implementations of the transforms in this paper: (1) A MATLAB implementation, where the NUFFT-II transform is assessable via the chebfun.nufft command in Chebfun [9], 4 and (2) A Julia implementation, which is publicly available via the nufft2 command in the FastTransforms.jl package [24]. Since the dominating computational cost of our transforms are FFTs, and these are computed via the FFTW library [13], the cost of our algorithms are approximately the same in MATLAB and Julia. 5 Recall that there are two stages of the transform: (1) A planning stage in which ancillary quantities are computed (see Section 2.3) and (2) An online stage, where the transform needs knowledge of the vector c in (1.2) and the desired vector f is computed. When the same NUFFT-II transform is applied to multiple vectors, the planning stage is only performed once while the online stage is executed for every new vector. Figure 2.2 (left) shows the execution times 6 of the NUFFT-II transform in both the planning stage and the online stage the NUFFT-II (right). The online stage of the NUFFT-II is approximately 16 FFTs in double precision, as expected from Table 2.1, and takes approximately 8 seconds to compute the transform when N is 16 million. Figure 2.2 shows that our NUFFT-II is competitive to the Julia implementation of the NFFT software [17]. Figure 2.3 (left) demonstrates the execution times of the online stage of our NUFFT-II for samples that are perturbed equispaced grids with γ = 1/2, γ = 1/8, γ = 1/32, and γ = 0 (see (2.1)). For definitiveness, we chose the samples to be the so-called worst grid for each γ in the NUFFT-II (see [2, Sec. 3.3.1]), i.e., We see that the NUFFT-II is more computationally efficient when the samples are closer to an equispaced grid, as expected from the values of K in Table 2.1.
Our NUFFT-II relies on a matrix approximation; namely, the approximation of the matrix A in (2.3) by a low rank approximation A K . Therefore, if f exact is the vector calculated from f exact =F 2 c = (A • F )c, then our algorithm calculates the approximation f approx = (A K • F )c. The incurred error can be simply bounded as follows: where · F denotes the matrix Frobenius norm and the last inequality follows from the fact that A − A K max ≤ ǫ and F F = N . In It is equivalent to evaluating a generalized Fourier series at equispaced points and computing the matrix-vector productF 1 c, where ( For this transform, we immediately find that where the frequencies ω 0 /N, . . . , ω N −1 /N act as nonequispaced sampled in a NUDFT-II. Therefore, we see that the NUDFT-I matrix is equivalent to a transposed NUDFT-II matrix. Since the transpose of a sum of matrices is equal to the sum of the individual terms transposed, (2.11) immediately leads to our NUFFT-I: Therefore,F 1 c, can be computed in O(N log N log(1/ǫ)/ loglog(1/ǫ)) operations using the relationship F ⊺ c = N F −1 c and the inverse FFT.
Since |N (x j −s j /N )| ≤ 1/2 for 0 ≤ j ≤ N −1 and ω k /N ∈ [0, 1) for 0 ≤ k ≤ N −1, we know from Theorem A.2 that A can be approximated by a rank K matrix A K such that A − A K max ≤ ǫ and K = O(log(1/ǫ)/ loglog(1/ǫ)), where 0 < ǫ < 1 is a working precision. Moreover, the matrix B is of rank at most 2 since where 1 is the N ×1 column vector of ones. Therefore, A•B can be well-approximated by a rank O(K) matrix and hence,F 3 c can be computed in O (KN log N ) operations.
In double precision, the cost of this NUFFT-III is at most 32(= 16 × 2) NUFFT-I's or, equivalently, 512(= 32 × 16) FFTs of size N . These FFTs can all still be performed in parallel. In the case when the sequences s 0 , . . . , s N −1 and t 0 , . . . , t N −1 are the same, which often occurs (see (2.9)), the matrix B is the matrix of all ones. In this situation, A • B = A and the cost of the NUFFT-III is reduced by a factor of 2.
This transform is available in the chebfun.nufft command in Chebfun [9].

Inverse nonuniform fast Fourier transforms.
In the NUFFT-I, -II, and -III, severely nonequispaced samples or noninteger frequencies were not a numerical issue and the parameter γ in (2.1) only mildly affected the computational cost of the transform. For the inverse transforms, nonuniform samples or frequencies are far more detrimental in terms of both accuracy and computational cost.
The inverse NUDFT-II requires that the linear systemF 2 c = f is solved for the vector c, whereF 2 is given in (1.2). Here, we will assume that the samples x 0 , . . . , x N −1 are perturbed equispaced samples with 0 ≤ γ < 1/2 (see (2.1)) to ensure thatF −1 2 exists. Since we have a fast matrix-vector product forF 2 (see Section 2), one naturally tries a variety of Krylov methods. After trying several of them, we advocate the following approach based on the conjugate gradient method (CG).
The matrixF 2 is not a positive definite matrix, i.e., it is not symmetric with positive eigenvalues, so the conjugate gradient method cannot be immediately applied. Instead, we use the conjugate gradient method on the normal equations:F * 2F2 c = F * 2 f . By considering the (j, k) entry ofF * 2F2 , we find that it only depends on the value of j − k: Hence,F * 2F2 is a Toeplitz matrix, i.e., a matrix with constant diagonal entries, as noted previously in [10]. Therefore, a matrix-vector product withF * 2F2 can be computed using a fast Toeplitz multiply, costing just one FFT and one inverse FFT of size 2N [14,Sec. 4.7.7]. 7 Let the number of conjugate gradient iterations be denoted by R cg . Since CG requires one matrix-vector product per iteration, the inverse transform costs the same as 2R cg FFTs of size 2N , ignoring O(K) FFTs to computeF 2 f and the calculation of the eigenvalues of a circulant matrix (see [14,Sec. 4.7.7]). Therefore, this iterative method leads to an inverse NUFFT-II with a computational cost of O(R cg N log N ) operations, which is quasi-optimal provided that R cg does not grow too quickly with N . 7 Note that the first column and row ofF * 2F2 are the same due to symmetry and the first column ofF * 2F2 can be obtained via the relationF * 2F2 e 1 , where e 1 is the first canonical vector. 12  .1 shows that empirically R cg is observed to be small and, perhaps, bounded with N when 0 < γ < 1/4. 8 When the samples are uniformly sampled, F 2 = F and R cg = 1. As the perturbation parameter, γ, is increased from 0 to 1/2, the condition number ofF 2 -and hence R cg -can increase without bound. For example, when γ = 1/2, the samples may not be distinct and soF −1 2 may not exist. To fully understand the algorithmic complexity of our inverse NUFFT-II, we need to bound R cg . One can do this immediately if a bound on the condition number of F 2 is known. The recent theoretical work on the Lebesgue constant for trigonometric interpolation with nonequispaced points in [2,3] is potentially helpful for bounding the condition number ofF 2 ; however, we have not been able to derive a bound in terms of γ on this yet.

The two-dimensional nonuniform fast Fourier transform of type II.
Given an m × n matrix of Fourier coefficients C ∈ C m×n and nonuniform samples (x 0 , y 0 ), . . . , (x N −1 , y N −1 ) ∈ R 2 , the two-dimensional NUDFT-II is the task of computing the following vector: Naively, this requires O(N mn) operations since there are N sums with each sum contain mn terms. Here, we describe an algorithm that requires only O(mn(log(n) + log(m)) + N ) operations. . . , (s x N −1 , s y N −1 ) ∈ N 2 is defined so that (s x j /n, s y j /m) is the closest m × n equispaced grid point to (x j , y j ) for 0 ≤ j ≤ N − 1. If (x j , y j ) is assigned to one of the grid points denoted by one of the unfilled black circles, i.e., either s x j = n or s y j = m or both, then the sample is reassigned using the periodicity of the complex exponential function (see (4.3)).
It is helpful to start by reviewing the uniform two-dimensional FFT, which computes the vector f ∈ C mn×1 (by default N = mn) such that The samples in (4.2) lie on the m × n equispaced tensor grid (s/m, t/n) for 0 ≤ s ≤ m − 1 and 0 ≤ t ≤ n − 1. In the Julia language, the vector f in (4.2) can be computed by the command fft(C) [:] in O(mn(log m + log n)) operations. As in Section 2.2, we first define a sequence (s x 0 , s y 0 ), . . . , (s x N −1 , s y N −1 ) ∈ N 2 such that (s x j /n, s y j /m) is the closest point from an m × n equispaced grid to (x j , y j ) for 0 ≤ j ≤ N − 1. By definition, we have If s x j = n or s y j = m for any 0 ≤ j ≤ N − 1, the equispaced sample (s x j /n, s y j /m) does not appear in the two-dimensional FFT in (4.2). Analogous to the sequence t 0 , . . . , t N −1 in (2.9), we reassign the sample (x j , y j ) using the periodicity of the complex exponential function. That is, we define a new sequence (t x 0 , t y 0 ), . . . , (t x N −1 , t y N −1 ) such that Using these two sequences, we can rewrite (4.1) as Here, A x ∈ C N ×n and A y ∈ C N ×m are matrices that can be well-approximated by low rank matrix because n|x j − s x j /n| ≤ 1/2, k 1 /n ∈ [0, 1], m|y j − s y j /m| ≤ 1/2, and k 2 /m ∈ [0, 1]. Using the ideas in Section 2.1.2, we can construct vectors such that In double precision, K 1 and K 2 are both at most 16 (see Table 2.1). Moreover, we note that e −2πi(k 1 t x j /n+k 2 t y j /m) is closely related to the complex exponential function in the uniform two-dimensional DFT in (4.2).
Substituting the low rank representations for A x and A y into (4.4), absorbing the sums over k 1 and k 2 into matrix-matrix products, and using the fact that (uv ⊺ ) • C = D u CD v , we find that (4.1) can be expressed as (4.5) Here, [vec(A)] mt x j +t y j denotes the mt x j + t y j entry of the vector vec(A). The sum in (4.5) leads to a quasi-optimal complexity transform for the twodimensional NUFFT-II. There are K 1 K 2 terms in (4.5) each requiring an m × n twodimensional FFT with a diagonally-scaled coefficient matrix C. Moreover, since each term is adding together N ×1 vectors, the total cost of the transform is O(K 1 K 2 mn(log m+ log n) + N ) operations. With an explicit dependence on the working accuracy 0 < ǫ < 1, this becomes O(mn(log m + log n) log(1/ǫ) 2 / loglog(1/ǫ) 2 + N ) operations.
The cost of the transform can be moderately reduced by noting that CD v x r 1 F ⊺ n in (4.5) does not depend on r 2 and can be computed just once for each 0 ≤ r 1 ≤ K 1 −1. This reduces the cost to O(K 1 K 2 mn log m + K 1 mn log n + N ) operations. For implementations of this two-dimensional transform, see the chebfun.nufft2 command in Chebfun [9] and the nufft2d command in FastTransforms.jl [24]. There are two other types of two-dimensional NUFFTs, which can be implemented with similar ideas as well as multidimensional NUFFTs.
Lemma A.1. Let 0 < ǫ < 1 be a working precision and γ > 0. The following holds: where T p is the degree p Chebyshev polynomial, the a pq coefficients are given in (A.1), and the primes on the summands indicate that the first term should be halved. Here, the integer K satisfies:  Using [21, (10.14.1) and (10.14.7)], we find that |a pr | ≤ 4 eγπ p + r (p+r)/2 , max(p, r) ≥ 1.
Therefore, by setting s = p + r, we can bound the error as where the last inequality used eγπ ≤ 5, eπ/2 ≤ 5, and K ≥ K − 1. By solving for K ≥ 3 such that ∞ p=K ∞ r=K |a pr | ≤ ǫ, we find that we can take K to be K = max 3, 5γe W(log(140/ǫ)/(5γ)) = O log(1/ǫ) loglog(1/ǫ) , ǫ → 0, where W (x) is the Lambert W function. The asymptotic approximation for the lower bound on K as ǫ → 0 is derived from the asymptotic expansion for W (x) as x → ∞ [21, (4.13.10)]. We now evaluate the truncated Chebyshev expansion constructed in Lemma A.1 to derive a rank K approximation to the matrix A in (2.3). We make the additional restriction that γ ≤ 1/2 in the statement of the theorem below because we do not construct low rank approximations to A when γ > 1/2 (see Section 2.2).
Theorem A.2. Let N ≥ 1 be an integer, 0 < ǫ < 1, and x 0 , . . . , x N −1 samples such that (2.1) holds with 0 < γ ≤ 1/2. Consider the N × N matrix where ω 0 , . . . , ω N −1 ∈ [0, N ]. Then, there exists a rank K matrix A K such that A − A K max ≤ ǫ, where ⊺ . Then, A = exp(−2πi(x − e)ω ⊺ ), where the exponential function is applied entry-by-entry to its matrix input. Since the entries in N x are in [−γ, γ] and the entries of 2πω/N are in [0, 2π], we can apply Lemma A.1 to each entry of A. We conclude that for K = max{3, ⌈5γe W(log(140/ǫ)/(5γ)) ⌉} we have (A.2) where T p (x) is the degree p Chebyshev polynomial, the coefficients a pr are given in (A.1), 1 is the N × 1 column vector of ones, and the prime on the summands indicate that the first term is halved.
Each term in the double sum in (A.2) is a rank-1 term so it may look like A K is of rank at most K 2 ; however, by appropriately grouping the terms as follows: T r ( 2ω ⊺ N − 1 ⊺ ) A rank-1 matrix , we conclude that A K is a matrix of rank at most K, as required. The asymptotic order of K given in the statement of the theorem comes from the asymptotic expansion of W (x) as x → ∞ [21, (4.13.10)].