While the observation has previously focused on latency it also affects throughput: whereas you could run 2 independent shifts per cycle before, each shift going to either p0 or p6, this anomaly lowers this to a single shift per cycle.
Besides the shifts and rotations BZHI and BEXTR are also affected. While they are not a shift per se, it makes sense that it would be implemented with the same circuitry (e.g., BZHI is dst = arg1 & ~(-1 << arg2)). BEXTR in particular goes from 2 to 6 cycle latency!
Another thing I'm noticing is that the affected instructions are all p06 shift ops. You can alternatively implement rotation using SHLD r,r,c but this is a p1 operation and I have not seen any slowdown from this issue.
While the observation has previously focused on latency it also affects throughput: whereas you could run 2 independent shifts per cycle before, each shift going to either p0 or p6, this anomaly lowers this to a single shift per cycle.
Besides the shifts and rotations BZHI and BEXTR are also affected. While they are not a shift per se, it makes sense that it would be implemented with the same circuitry (e.g., BZHI is dst = arg1 & ~(-1 << arg2)). BEXTR in particular goes from 2 to 6 cycle latency!
Another thing I'm noticing is that the affected instructions are all p06 shift ops. You can alternatively implement rotation using SHLD r,r,c but this is a p1 operation and I have not seen any slowdown from this issue.