红黑高斯赛德尔和 OpenMP
问题描述
与 MPICH 相比,我试图用 OpenMP 来证明一个观点,我编写了以下示例来演示在 OpenMP 中实现一些高性能是多么容易.
I was trying to prove a point with OpenMP compared to MPICH, and I cooked up the following example to demonstrate how easy it was to do some high performance in OpenMP.
Gauss-Seidel 迭代被分成两个独立的运行,这样在每次扫描中,每个操作都可以按任何顺序执行,并且每个任务之间不应该有依赖关系.所以理论上每个处理器都不应该等待另一个进程执行任何类型的同步.
The Gauss-Seidel iteration is split into two separate runs, such that in each sweep every operation can be performed in any order, and there should be no dependency between each task. So in theory each processor should never have to wait for another process to perform any kind of synchronization.
我遇到的问题是,与问题大小无关,我发现只有 2 个处理器的加速很弱,如果有 2 个以上的处理器,它甚至可能会更慢.许多其他线性并行例程我可以获得非常好的缩放,但这一个很棘手.
The problem I am encountering, is that I, independent of problem size, find there is only a weak speed-up of 2 processors and with more than 2 processors it might even be slower. Many other linear paralleled routine I can obtain very good scaling, but this one is tricky.
我担心我无法向编译器解释"我对数组执行的操作是线程安全的,因此它无法真正有效.
My fear is that I am unable to "explain" to the compiler that operation that I perform on the array, is thread-safe, such that it is unable to be really effective.
请参见下面的示例.
任何人都知道如何使用 OpenMP 更有效地实现这一点?
Anyone has any clue on how to make this more effective with OpenMP?
void redBlackSmooth(std::vector<double> const & b,
std::vector<double> & x,
double h)
{
// Setup relevant constants.
double const invh2 = 1.0/(h*h);
double const h2 = (h*h);
int const N = static_cast<int>(x.size());
double sigma = 0;
// Setup some boundary conditions.
x[0] = 0.0;
x[N-1] = 0.0;
// Red sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 1; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2/2.0)*(b[i] - sigma);
}
// Black sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 2; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2/2.0)*(b[i] - sigma);
}
}
补充:我现在也尝试使用原始指针实现,它的行为与使用 STL 容器的行为相同,因此可以排除它是来自 STL 的一些伪临界行为.
Addition: I have now also tried with a raw pointer implementation and it has the same behavior as using STL container, so it can be ruled out that it is some pseudo-critical behavior comming from STL.
推荐答案
首先,确保 x
向量与缓存边界对齐.我做了一些测试,如果我强制对齐内存,我的机器(核心二重奏)上的代码会得到 100% 的改进:
First of all, make sure that the x
vector is aligned to cache boundaries. I did some test, and I get something like a 100% improvement with your code on my machine (core duo) if I force the alignment of memory:
double * x;
const size_t CACHE_LINE_SIZE = 256;
posix_memalign( reinterpret_cast<void**>(&x), CACHE_LINE_SIZE, sizeof(double) * N);
其次,您可以尝试为每个线程分配更多计算(通过这种方式,您可以保持缓存行分离),但我怀疑 openmp 已经在后台做了类似的事情,所以它可能对大 N 毫无价值.
Second, you can try to assign more computation to each thread (in this way you can keep cache-lines separated), but I suspect that openmp already does something like this under the hood, so it may be worthless with large N.
在我的例子中,当 x
没有缓存对齐时,这个实现要快得多.
In my case this implementation is much faster when x
is not cache-aligned.
const int workGroupSize = CACHE_LINE_SIZE / sizeof(double);
assert(N % workGroupSize == 0); //Need to tweak the code a bit to let it work with any N
const int workgroups = N / workGroupSize;
int j, base , k, i;
#pragma omp parallel for shared(b, x) private(sigma, j, base, k, i)
for ( j = 0; j < workgroups; j++ ) {
base = j * workGroupSize;
for (int k = 0; k < workGroupSize; k+=2)
{
i = base + k + (redSweep ? 1 : 0);
if ( i == 0 || i+1 == N) continue;
sigma = -invh2* ( x[i-1] + x[i+1] );
x[i] = ( h2/2.0 ) * ( b[i] - sigma );
}
}
总之,你肯定有缓存冲突的问题,但考虑到 openmp 的工作方式(遗憾的是我不熟悉它),它应该足以使用正确分配的缓冲区.
In conclusion, you definitely have a problem of cache-fighting, but given the way openmp works (sadly I am not familiar with it) it should be enough to work with properly allocated buffers.
这篇关于红黑高斯赛德尔和 OpenMP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!