I'm running a simple kernel which adds two streams of double-precision complex-values. I've parallelized it using OpenMP with custom scheduling: the slice_indic
I'm attempting to benchmark the memory bandwidth on a ccNUMA system with 2x Intel(R) Xeon(R) Platinum 8168: 24 cores @ 2.70 GHz, L1 cache 32 kB, L2 cache 1 MB a