You are write on the time issue (but it does work if a and b are the same)
The reason becomes very apparent if you write out the assembly necessary to compile the xor, vs what is needed to compile a classic swap.
The xor would look something like:
load $1, a
load $2, b
xor $3, $1, $2
xor $4, $3, $2
xor $5, $4, $3
store b, $4
store a, $5
where as the classic swap would look something like this:
load $1, [a]
load $2, [b]
store b, $1
store a, $2
which is much better.
If you take the pipeline into account, then the difference between the swap and the xor can be huge because their is high level of dependence between the instructions.
On a theoretical classic 5-stage pipeline, the swap approach ends up needing about 10 cycles, where as the xor needs about 18.
In a real processor the difference would probably be much worse.
If you've hit memory, you've already lost. Running instructions will basically be line noise compared to that. Similarly, function call overhead will probably dwarf actually doing the xors.
However, with a good register allocator and inlined functions, your point becomes even stronger. The compiler can simply remove all instructions associated with the naieve swapping version, and simply record that before the swap %eax holds b, and after it, %eax now holds a. (Even the most naieve of code generators will do this sort of thing if the values in the swap don't fall outside of a basic block). The xors are far harder to optimize.
Assume you had a sorting algorithm that swaps values, and in some cases would swap two of the same values. It's easier (and often faster, thanks to the high cost of branch misprediction) to simply assume the swap of something with itself is a no-op than to explicitly check the address of the value to make sure it isn't the same. Or you could have just forgotten to make the check, and allowed the user to pass in their own swap functions (say, to deep-copy structs in C).
Your code should be robust enough to provide an intuitive default behavior for corner cases like that. That way the application code doesn't need to be cluttered with special conditions.