People have been suggesting the time stamp counter, but that's actually not idea...

People have been suggesting the time stamp counter, but that's actually not ideal, because it has a lot of overhead. On my desktop at work (a very beefy Intel Xeon) it adds about 30 cpu cycles. It also drains all the pipelines.

For a microbenchmark like this, I find it's usually better to call it in a loop 1,000,000 times, and compute the total time. That's often a "best case" scenario, where e.g. the cpu doesn't need to decode the instructions every iteration because they fit in appropriate cache. But it avoids the overhead of the timestamp counter.