Here's one better, if all your data is strictly less than (not equal to) a cache line (usually 64 bytes) in size, and your processor guarantees memory ordering within a cache line (most do):
1) Keep head and tail pointers local to the consumer and producer.
2) Associate a bit with each entry in the queue which denotes whether the entry is full or not. The bit must live within the same cache line as the entry itself.
3) Block on this bit, rather than head & tail pointers.
4) Set an entry's bit after filling the entry; clear it before.
5) You can now elide the memory fence (implicit in the .lazySet() method of the atomic objects). Performance will skyrocket.
1) Keep head and tail pointers local to the consumer and producer.
2) Associate a bit with each entry in the queue which denotes whether the entry is full or not. The bit must live within the same cache line as the entry itself.
3) Block on this bit, rather than head & tail pointers.
4) Set an entry's bit after filling the entry; clear it before.
5) You can now elide the memory fence (implicit in the .lazySet() method of the atomic objects). Performance will skyrocket.