Getting stuff out of cache is also worthwhile in some cases. I had some message passing between cores, and apparently the CPU wasn't good at telling that my chance of using a populated and sent message slot again was zero. Adding a clflush after send significantly reduced my number of cache misses.
This is an aspect of multi-core programming. As the programmer, you must be painfully aware of the cache lines you're touching and which cores touch them. Having two or more cores touch a cache line will cause expensive core-to-core synchronization taking place. The case is easier in a message passing system like yours, where the other core is clearly the sender and can flush the caches. In kernel space, you can disable caches or use write-combine or write-through caches for memory regions as needed but this isn't really available in userspace.
This can happen in very unintuitive places. For example, in Java, array.size() reads from the same cache line as the beginning of the array. If you're partitioning an array processing routine for many cores, you should not call array.size() to avoid very expensive synchronization.