That's true, it is hard to predict. But I really interested to see what is the most optimistic prediction for my particular problem to work on the GPGPU. In this case, I don't think LRU cache will do much help since it has a uniform access pattern (every piece of data has to be examined to every proposed feature). However, you do remind me that may be load-ahead fashion of caching strategy will help. And if the needed data is load to cache with some synchronization method to guarantee all current running kernel will use the piece of data to its examining feature, the performance gain may achieved. Actually, I gonna spend this weekend to try out.
i don't really get what you're doing, but have you considered making one dimension of your work vary over feature? if you arrange that correctly then you only need to scan the memory once (all features read the first byte of memory; then all features read the next...)