"the horsing around to get the data in and out" seems to be the key factor. An a...

"the horsing around to get the data in and out" seems to be the key factor. An analysis of BLAS libraries' performance across several architectures [1] showed that GPU-based calculation only approached implementations like Goto BLAS with matrix dimensions well up into the thousands. That's just one example, but there seems to be a fair bit of overhead in getting the data to and from the GPU.

[1] http://dirk.eddelbuettel.com/blog/code/gcbd/