Why is FFTW written in OCaml and what makes it so fast?

jacobolus · on Oct 16, 2018

According to DJB, what makes it so fast is deceptive benchmarks, https://cr.yp.to/djbfft/bench-notes.html

Maybe something has changed in the 2 decades since that? Their benchmark page seems to have not really changed in at least a decade.

I would guess that folks making use of GPUs could probably get quite a speedup on this type of numerical workload in 2018.

acrispino · on Oct 17, 2018

For what it's worth, the wikipedia talk page for FFTW has response to DJB's page from someone identifying themselves as an author of FFTW

https://en.wikipedia.org/wiki/Talk%3AFFTW#Bernstein_%22contr...

jacobolus · on Oct 17, 2018

Thanks for tracking that one down.

Summary: the FFTW people claim they weren’t trying to be malicious, and after DJB called them out publicly (when his private emails didn’t work) they eventually entirely removed djbfft from their benchmark summary pages.

* * *

For many use cases, it is possible (and helpful) to stick to power-of-2 sizes of FFTs, because these are significantly faster than arbitrary sizes. In some cases it is fine to not re-order the terms at one of the ends because it only matters that the order is consistent internally, not with some external spec; this can also save a bunch of time.

But for other contexts it is of course important to handle any arbitrary lengths including those that aren’t products of small primes, and input/output terms in a canonical order. FFTW tries to be reasonably fast in every possible context by doing fancy runtime code specialization.

I’m curious if anyone has a page with updated benchmarks for various sizes of FFTs on modern CPUs and maybe GPUs, testing whatever FFT libraries out there are still being actively maintained (including those which limit themselves to power-of-2 sizes). CPUs have changed some in the last 10–15 years (more SIMD, more important to optimize for cache, etc.), and GPUs are a whole lot faster now.

The FFTW benchmark site still doesn’t have anything more recent than ~2005 as far as I can tell.

* * *

One more thing: if anyone knows a simple and fast WebAssembly FFT implementation (ideally with benchmarks shown for common web browsers / CPUs) I would love to know about it. Just power-of-2 sizes would be fine.

mpreda · on Oct 17, 2018

Prime95 https://www.mersenne.org/download/ has what is probably the fastest CPU FFT implementation; specialized and carefully tuned to Intel & AMD CPUs, with AVX assembly.

My project GpuOwl https://github.com/preda/gpuowl/ has a GPU double-precision FFT convolution implementation which is faster (and simpler IMO) than clFFT.

radarsat1 · on Oct 17, 2018

curious at what size GPU becomes faster than CPU ? (taking into account memory transfer time..)

ChrisRackauckas · on Oct 17, 2018

>I’m curious if anyone has a page with updated benchmarks for various sizes of FFTs on modern CPUs and maybe GPUs, testing whatever FFT libraries out there are still being actively maintained (including those which limit themselves to power-of-2 sizes)

This is the most updated that I know of online (2014)

https://github.com/JuliaLang/julia/pull/6193#issuecomment-61...

If the array is large enough for a GPU, then yes a GPU version would rip it to shreds (not shown, but known). You can see the MKL advantage in there, but FFTW still performs pretty admirably. Everyone pretty much uses MKL or FFTW these days. I don't think there's as much development in this space because these are all pretty well-optimized now. Multi-process parallel for these is supposedly not difficult and can use a very similar strategy, but GPU really requires some different approaches and that's where the development is these days. Also, a lot of mathematicians have had a more recent focus on advancing the realm of spectral discretizations beyond FFTs, for example see Chebfun and ApproxFun, and things like FastTransforms.jl where the interesting this is non-uniformly spaced FFTs and similar transforms.

jacobolus · on Oct 17, 2018

> for example see Chebfun and [Julia port] ApproxFun

These are great technology which should be widely known and adopted. They are designed for working with functions defined on an interval rather than a periodic domain, and doing a bunch more with them. http://www.chebfun.org

Ironically in the context of this discussion, these tools could get a whole heck of a lot faster with some dedicated effort from some low-level optimization experts.

wyldfire · on Oct 17, 2018

Popular GPU FFT implementations are cuFFT [1] and clFFT [2]. IIRC cuFFT performs very well.

[1] https://developer.nvidia.com/cufft

[2] https://clmathlibraries.github.io/clFFT/

sambe · on Oct 16, 2018

I don't know anything about FFTW, but the question/answer seem misleading:

GitHub language details of the non-generated sources say 75% C: https://github.com/FFTW/fftw3

FFTW author comment linked from Quora answer contradicts said answer (~2/3 C): https://groups.google.com/d/msg/fa.caml/B5kFMTl67MU/9c8swiOE...

Am I missing something here? Double generation?

ziotom78 · on Oct 16, 2018

The author of the answer is Jon Harrop, a well-known former Ocaml evangelist (and author of an awesome book, "Ocaml for scientists", [1]). He sometimes gets a bit too far in explaining the virtues of Ocaml...

I carefully studied the source code of FFTW a few years ago and read a few of Frigo's papers (e.g., [2]). The purpose of FFTW is to apply direct/inverse Fourier transforms to vectors of N elements, and it does so by employing a number of algorithms to split the data into many smaller chunks and then applying the transform to each of these sub-chunks. (This is basically the idea of the FFT.)

When a chunk is small enough (e.g., 10 elements), it can be worth to stop splitting it into smaller subpieces but instead to apply a direct brute-force formula. This is where OCaml comes into the game. FFTW's authors have written OCaml snippets to compute the Fourier Transform in a number of ways. This code is written using custom operators, so that the snippet is not compiled down to machine code but instead kept in a AST. An optimizing complier (written in OCaml) is then ran on the AST to perform a number of optimizations to the snippet, and finally output C code is generated out of the optimized AST. Among the optimizations there are some classical ones (e.g., constant folding); others are domain-specific, because they employ some property or symmetry of the Fourier Transform. The C code that is produced as output is taylored for some specific situation, e.g., direct transform of 13 purely real numbers.

The algorithm that takes N numbers and splits them in chunks is written in C, as well as the code that decides when to stop the recursive splitting and which OCaml-generated snippet is best suited for each chunk. Moreover, C is used for all the function that allocate and handle the many data structures used by the library. Therefore, it is a bit of a stretch to say that FFTW is written in OCaml, although its most interesting part (the optimizing compiler) is indeed. This is the reason why Frigo's paper [2] deals mainly with the OCaml part.

[1] http://www.ffconsultancy.com/products/ocaml_for_scientists/

[2] http://www.fftw.org/fftw-paper-ieee.pdf

al_chemist · on Oct 16, 2018

From article "Note that some people (such as Frank) get confused because FFTW is commonly distributed in the form of precompiled C code. That is not the source code. Most of the C code is generated by OCaml source code"

sambe · on Oct 17, 2018

My comment contains references which refute that claim, which is why I made it.

tanderson92 · on Oct 16, 2018

FFTW author is commenting on LOC and Github appears to be going off of number of files. Github also incorrectly counts C headers as C++.

scott_s · on Oct 16, 2018

Since the linked explanation assumes you know what it is and what it stands for: Fastest Fourier Transform in the West, https://en.wikipedia.org/wiki/FFTW

gnufx · on Oct 16, 2018

FFTW certainly isn't simply written on OCaml. Apart from higher-level C, at base it has the sort of low-level micro-architecture-specific kernels you'd expect (using C intrinsics rather than assembler in this case).

[There was useful advice in https://stackoverflow.com/a/3058546 for example, and doubtless more recently, concerning trolling.]

flatline · on Oct 16, 2018

TIL that FFTW is not just another a C library, per se. Impressive, but it does explain a few things, as the C library is fairly cumbersome to use.

slivym · on Oct 16, 2018

The problem I've always had with FFTs is that it's incredibly difficult to write an optimal FFT for X size, Y direction, Z variability (X=2^4-2^64, Y=FWD,REV,BIDIR, Z=2^4-2^(log2(X)))

Does the metaprogramming element of FFTW work well, or does it boil down to "If X=2^4, Y=FWD, Z=X: build24FWDX()"?

elcritch · on Oct 17, 2018

Based another comment by someone who studied the source, it sounds like the ocaml code generates C code for the various optimized kernels. Wonder how the C kernel chooser code looks like.

FWIW, proper hygienic macros and scientific focused array types makes this sort of meta-programming optimization in Julia kinda fun.

Though just using type dispatching / multi-methods might be all that’s needed. It all makes me want to have a reason to write that kind of optimized code... For example if you can lift the size, variable, and variability into types you could match on your config roughly like: “fft(X :: Pwr2_64, Y :: FWD, Dir :: REV, Var :: BIDIR, Z) = 2^4-2^(log2(X))”.

ChrisRackauckas · on Oct 17, 2018

>FWIW, proper hygienic macros and scientific focused array types makes this sort of meta-programming optimization in Julia kinda fun.

And FWIW the author of FFTW is a big Julia contributor / package author nowadays. A long time ago (2014) he did write a Julia-based version of FFTW:

https://github.com/JuliaLang/julia/pull/6193#issuecomment-61...

This was back before Julia could do auto-SIMD optimizations, and it got quite close to FFTW without SIMD. He mentioned somewhere that modern Julia should be able to get very close to FFTW with SIMD, given that Julia now has better inlining heuristics, interprocedural optimizations, automatic SIMD, etc. (2014 was the stone age for Julia). If you read the rest of the thread, most of the discussion was about a standard library system so FFTW could be moved out. With Julia v1.0 FFTW is now in the standard library, which gives room for a pure Julia FFT to be standard. So this should get revived soon and we will have a beautiful generic FFTW algorithm. I can't wait :)

Sean1708 · on Oct 17, 2018

> And FWIW the author of FFTW is a big Julia contributor / package author nowadays.

Holy shit! As a long time user of both, I'd never put two and two together before.

Also:

> With Julia v1.0 FFTW is now in the standard library

Did you mean "FFTW is no longer in the standard library"? Since it's ... well ... not (it's a separate package that you have to install).

slivym · on Oct 17, 2018

It's in the default METADATA branch in Julia: https://github.com/JuliaLang/METADATA.jl/tree/metadata-v2/FF...

so you can just Pkg.add it rather than having to Pkg.clone it from a special source.

ChrisRackauckas · on Oct 17, 2018

Yeah, I meant to say it's no longer in the standard library. It's not given special privileges anymore, so we are free to iterate alternatives to it. I'm not sure how to edit and fix that post (there's another typo in there :( )

jancsika · on Oct 16, 2018

> run-time profiling to choose the fastest codelets (which is largely affected by the target architecture).

What's the quick overview of how this run-time profiling works?

The value of such a system seems immense but I only have seen it mentioned with reference to fftw. Have there been efforts to generalize that process to realtime DSP programming?

Mindless2112 · on Oct 16, 2018

When you ask FFTW to construct a plan, it benchmarks a variety of options and uses the fastest one.

> FFTW_MEASURE tells FFTW to find an optimized plan by actually computing several FFTs and measuring their execution time. Depending on your machine, this can take some time (often a few seconds). FFTW_MEASURE is the default planning option.

> FFTW_PATIENT is like FFTW_MEASURE, but considers a wider range of algorithms and often produces a “more optimal” plan (especially for large transforms), but at the expense of several times longer planning time (especially for large transforms).

> FFTW_EXHAUSTIVE is like FFTW_PATIENT, but considers an even wider range of algorithms, including many that we think are unlikely to be fast, to produce the most optimal plan but with a substantially increased planning time.

http://www.fftw.org/fftw3_doc/Planner-Flags.html

I believe this sort of technique is used for some cryptography libraries as well, but I don't have any specifics.

jancsika · on Oct 17, 2018

Thanks, this is fascinating.

I'll delve into it at some point.

But for now, say I'm on Debian and I install the fftw package. Could the packager put a post install script to run the runtime tests and cache the results for the library to use so that apps which leverage fftw don't have to do the test on each instantiation?

Edit: clarification

cameldrv · on Oct 16, 2018

I don't know about DSPs, but the general concept is (https://en.wikipedia.org/wiki/Profile-guided_optimization)

The ATLAS BLAS/LAPACK library also uses a similar technique.

gnufx · on Oct 16, 2018

ATLAS really doesn't work the same as FFTW. (It also doesn't work as well as other free BLAS, unless it's improved since I last tried.)

a-dub · on Oct 16, 2018

good ol' genfft

a-dub · on Oct 16, 2018

i took a compilers course once in ocaml and asked the professor about fftw and he didn't know about it...

i was like

DUDE YOU MUST SEE THIS IS THE COOLEST APPLICATION OF OCAML COMPILERS EVER