People have been suggesting the time stamp counter, but that's actually not ideal, because it has a lot of overhead. On my desktop at work (a very beefy Intel Xeon) it adds about 30 cpu cycles. It also drains all the pipelines.
For a microbenchmark like this, I find it's usually better to call it in a loop 1,000,000 times, and compute the total time. That's often a "best case" scenario, where e.g. the cpu doesn't need to decode the instructions every iteration because they fit in appropriate cache. But it avoids the overhead of the timestamp counter.
For a microbenchmark like this, I find it's usually better to call it in a loop 1,000,000 times, and compute the total time. That's often a "best case" scenario, where e.g. the cpu doesn't need to decode the instructions every iteration because they fit in appropriate cache. But it avoids the overhead of the timestamp counter.