Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks for tracking that one down.

Summary: the FFTW people claim they weren’t trying to be malicious, and after DJB called them out publicly (when his private emails didn’t work) they eventually entirely removed djbfft from their benchmark summary pages.

* * *

For many use cases, it is possible (and helpful) to stick to power-of-2 sizes of FFTs, because these are significantly faster than arbitrary sizes. In some cases it is fine to not re-order the terms at one of the ends because it only matters that the order is consistent internally, not with some external spec; this can also save a bunch of time.

But for other contexts it is of course important to handle any arbitrary lengths including those that aren’t products of small primes, and input/output terms in a canonical order. FFTW tries to be reasonably fast in every possible context by doing fancy runtime code specialization.

I’m curious if anyone has a page with updated benchmarks for various sizes of FFTs on modern CPUs and maybe GPUs, testing whatever FFT libraries out there are still being actively maintained (including those which limit themselves to power-of-2 sizes). CPUs have changed some in the last 10–15 years (more SIMD, more important to optimize for cache, etc.), and GPUs are a whole lot faster now.

The FFTW benchmark site still doesn’t have anything more recent than ~2005 as far as I can tell.

* * *

One more thing: if anyone knows a simple and fast WebAssembly FFT implementation (ideally with benchmarks shown for common web browsers / CPUs) I would love to know about it. Just power-of-2 sizes would be fine.



Prime95 https://www.mersenne.org/download/ has what is probably the fastest CPU FFT implementation; specialized and carefully tuned to Intel & AMD CPUs, with AVX assembly.

My project GpuOwl https://github.com/preda/gpuowl/ has a GPU double-precision FFT convolution implementation which is faster (and simpler IMO) than clFFT.


curious at what size GPU becomes faster than CPU ? (taking into account memory transfer time..)


>I’m curious if anyone has a page with updated benchmarks for various sizes of FFTs on modern CPUs and maybe GPUs, testing whatever FFT libraries out there are still being actively maintained (including those which limit themselves to power-of-2 sizes)

This is the most updated that I know of online (2014)

https://github.com/JuliaLang/julia/pull/6193#issuecomment-61...

If the array is large enough for a GPU, then yes a GPU version would rip it to shreds (not shown, but known). You can see the MKL advantage in there, but FFTW still performs pretty admirably. Everyone pretty much uses MKL or FFTW these days. I don't think there's as much development in this space because these are all pretty well-optimized now. Multi-process parallel for these is supposedly not difficult and can use a very similar strategy, but GPU really requires some different approaches and that's where the development is these days. Also, a lot of mathematicians have had a more recent focus on advancing the realm of spectral discretizations beyond FFTs, for example see Chebfun and ApproxFun, and things like FastTransforms.jl where the interesting this is non-uniformly spaced FFTs and similar transforms.


> for example see Chebfun and [Julia port] ApproxFun

These are great technology which should be widely known and adopted. They are designed for working with functions defined on an interval rather than a periodic domain, and doing a bunch more with them. http://www.chebfun.org

Ironically in the context of this discussion, these tools could get a whole heck of a lot faster with some dedicated effort from some low-level optimization experts.


Popular GPU FFT implementations are cuFFT [1] and clFFT [2]. IIRC cuFFT performs very well.

[1] https://developer.nvidia.com/cufft

[2] https://clmathlibraries.github.io/clFFT/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: