Summary: the FFTW people claim they weren’t trying to be malicious, and after DJB called them out publicly (when his private emails didn’t work) they eventually entirely removed djbfft from their benchmark summary pages.
* * *
For many use cases, it is possible (and helpful) to stick to power-of-2 sizes of FFTs, because these are significantly faster than arbitrary sizes. In some cases it is fine to not re-order the terms at one of the ends because it only matters that the order is consistent internally, not with some external spec; this can also save a bunch of time.
But for other contexts it is of course important to handle any arbitrary lengths including those that aren’t products of small primes, and input/output terms in a canonical order. FFTW tries to be reasonably fast in every possible context by doing fancy runtime code specialization.
I’m curious if anyone has a page with updated benchmarks for various sizes of FFTs on modern CPUs and maybe GPUs, testing whatever FFT libraries out there are still being actively maintained (including those which limit themselves to power-of-2 sizes). CPUs have changed some in the last 10–15 years (more SIMD, more important to optimize for cache, etc.), and GPUs are a whole lot faster now.
The FFTW benchmark site still doesn’t have anything more recent than ~2005 as far as I can tell.
* * *
One more thing: if anyone knows a simple and fast WebAssembly FFT implementation (ideally with benchmarks shown for common web browsers / CPUs) I would love to know about it. Just power-of-2 sizes would be fine.
Prime95 https://www.mersenne.org/download/ has what is probably the fastest CPU FFT implementation; specialized and carefully tuned to Intel & AMD CPUs, with AVX assembly.
My project GpuOwl https://github.com/preda/gpuowl/ has a GPU double-precision FFT convolution implementation which is faster (and simpler IMO) than clFFT.
>I’m curious if anyone has a page with updated benchmarks for various sizes of FFTs on modern CPUs and maybe GPUs, testing whatever FFT libraries out there are still being actively maintained (including those which limit themselves to power-of-2 sizes)
This is the most updated that I know of online (2014)
If the array is large enough for a GPU, then yes a GPU version would rip it to shreds (not shown, but known). You can see the MKL advantage in there, but FFTW still performs pretty admirably. Everyone pretty much uses MKL or FFTW these days. I don't think there's as much development in this space because these are all pretty well-optimized now. Multi-process parallel for these is supposedly not difficult and can use a very similar strategy, but GPU really requires some different approaches and that's where the development is these days. Also, a lot of mathematicians have had a more recent focus on advancing the realm of spectral discretizations beyond FFTs, for example see Chebfun and ApproxFun, and things like FastTransforms.jl where the interesting this is non-uniformly spaced FFTs and similar transforms.
> for example see Chebfun and [Julia port] ApproxFun
These are great technology which should be widely known and adopted. They are designed for working with functions defined on an interval rather than a periodic domain, and doing a bunch more with them. http://www.chebfun.org
Ironically in the context of this discussion, these tools could get a whole heck of a lot faster with some dedicated effort from some low-level optimization experts.
The author of the answer is Jon Harrop, a well-known former Ocaml evangelist (and author of an awesome book, "Ocaml for scientists", [1]). He sometimes gets a bit too far in explaining the virtues of Ocaml...
I carefully studied the source code of FFTW a few years ago and read a few of Frigo's papers (e.g., [2]). The purpose of FFTW is to apply direct/inverse Fourier transforms to vectors of N elements, and it does so by employing a number of algorithms to split the data into many smaller chunks and then applying the transform to each of these sub-chunks. (This is basically the idea of the FFT.)
When a chunk is small enough (e.g., 10 elements), it can be worth to stop splitting it into smaller subpieces but instead to apply a direct brute-force formula. This is where OCaml comes into the game. FFTW's authors have written OCaml snippets to compute the Fourier Transform in a number of ways. This code is written using custom operators, so that the snippet is not compiled down to machine code but instead kept in a AST. An optimizing complier (written in OCaml) is then ran on the AST to perform a number of optimizations to the snippet, and finally output C code is generated out of the optimized AST. Among the optimizations there are some classical ones (e.g., constant folding); others are domain-specific, because they employ some property or symmetry of the Fourier Transform. The C code that is produced as output is taylored for some specific situation, e.g., direct transform of 13 purely real numbers.
The algorithm that takes N numbers and splits them in chunks is written in C, as well as the code that decides when to stop the recursive splitting and which OCaml-generated snippet is best suited for each chunk. Moreover, C is used for all the function that allocate and handle the many data structures used by the library. Therefore, it is a bit of a stretch to say that FFTW is written in OCaml, although its most interesting part (the optimizing compiler) is indeed. This is the reason why Frigo's paper [2] deals mainly with the OCaml part.
From article "Note that some people (such as Frank) get confused because FFTW is commonly distributed in the form of precompiled C code. That is not the source code. Most of the C code is generated by OCaml source code"
Since the linked explanation assumes you know what it is and what it stands for: Fastest Fourier Transform in the West, https://en.wikipedia.org/wiki/FFTW
FFTW certainly isn't simply written on OCaml. Apart from higher-level C, at base it has the sort of low-level micro-architecture-specific kernels you'd expect (using C intrinsics rather than assembler in this case).
The problem I've always had with FFTs is that it's incredibly difficult to write an optimal FFT for X size, Y direction, Z variability (X=2^4-2^64, Y=FWD,REV,BIDIR, Z=2^4-2^(log2(X)))
Does the metaprogramming element of FFTW work well, or does it boil down to "If X=2^4, Y=FWD, Z=X: build24FWDX()"?
Based another comment by someone who studied the source, it sounds like the ocaml code generates C code for the various optimized kernels. Wonder how the C kernel chooser code looks like.
FWIW, proper hygienic macros and scientific focused array types makes this sort of meta-programming optimization in Julia kinda fun.
Though just using type dispatching / multi-methods might be all that’s needed. It all makes me want to have a reason to write that kind of optimized code... For example if you can lift the size, variable, and variability into types you could match on your config roughly like: “fft(X :: Pwr2_64, Y :: FWD, Dir :: REV, Var :: BIDIR, Z) = 2^4-2^(log2(X))”.
This was back before Julia could do auto-SIMD optimizations, and it got quite close to FFTW without SIMD. He mentioned somewhere that modern Julia should be able to get very close to FFTW with SIMD, given that Julia now has better inlining heuristics, interprocedural optimizations, automatic SIMD, etc. (2014 was the stone age for Julia). If you read the rest of the thread, most of the discussion was about a standard library system so FFTW could be moved out. With Julia v1.0 FFTW is now in the standard library, which gives room for a pure Julia FFT to be standard. So this should get revived soon and we will have a beautiful generic FFTW algorithm. I can't wait :)
Yeah, I meant to say it's no longer in the standard library. It's not given special privileges anymore, so we are free to iterate alternatives to it. I'm not sure how to edit and fix that post (there's another typo in there :( )
> run-time profiling to choose the fastest codelets (which is largely affected by the target architecture).
What's the quick overview of how this run-time profiling works?
The value of such a system seems immense but I only have seen it mentioned with reference to fftw. Have there been efforts to generalize that process to realtime DSP programming?
When you ask FFTW to construct a plan, it benchmarks a variety of options and uses the fastest one.
> FFTW_MEASURE tells FFTW to find an optimized plan by actually computing several FFTs and measuring their execution time. Depending on your machine, this can take some time (often a few seconds). FFTW_MEASURE is the default planning option.
> FFTW_PATIENT is like FFTW_MEASURE, but considers a wider range of algorithms and often produces a “more optimal” plan (especially for large transforms), but at the expense of several times longer planning time (especially for large transforms).
> FFTW_EXHAUSTIVE is like FFTW_PATIENT, but considers an even wider range of algorithms, including many that we think are unlikely to be fast, to produce the most optimal plan but with a substantially increased planning time.
But for now, say I'm on Debian and I install the fftw package. Could the packager put a post install script to run the runtime tests and cache the results for the library to use so that apps which leverage fftw don't have to do the test on each instantiation?
Maybe something has changed in the 2 decades since that? Their benchmark page seems to have not really changed in at least a decade.
I would guess that folks making use of GPUs could probably get quite a speedup on this type of numerical workload in 2018.