.NET 的多线程与多处理:糟糕的 Parallel.ForEach 性能
问题描述
我编写了一个非常简单的字数统计"程序,它读取文件并计算文件中每个单词的出现次数.以下是部分代码:
I have coded a very simple "Word Count" program that reads a file and counts each word's occurrence in the file. Here is a part of the code:
class Alaki
{
private static List<string> input = new List<string>();
private static void exec(int threadcount)
{
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = threadcount;
Parallel.ForEach(Partitioner.Create(0, input.Count),options, (range) =>
{
var dic = new Dictionary<string, List<int>>();
for (int i = range.Item1; i < range.Item2; i++)
{
//make some delay!
//for (int x = 0; x < 400000; x++) ;
var tokens = input[i].Split();
foreach (var token in tokens)
{
if (!dic.ContainsKey(token))
dic[token] = new List<int>();
dic[token].Add(1);
}
}
});
}
public static void Main(String[] args)
{
StreamReader reader=new StreamReader((@"c: xt-setagg.txt"));
while(true)
{
var line=reader.ReadLine();
if(line==null)
break;
input.Add(line);
}
DateTime t0 = DateTime.Now;
exec(Environment.ProcessorCount);
Console.WriteLine("Parallel: " + (DateTime.Now - t0));
t0 = DateTime.Now;
exec(1);
Console.WriteLine("Serial: " + (DateTime.Now - t0));
}
}
简单明了.我使用字典来计算每个单词的出现次数.样式大致基于 MapReduce 编程模型.如您所见,每个任务都使用自己的私有字典.所以,没有共享变量;只是一堆自己计算单词的任务.以下是代码在四核 i7 CPU 上运行时的输出:
It is simple and straight forward. I use a dictionary to count each word's occurrence. The style is roughly based on the MapReduce programming model. As you can see, each task is using its own private dictionary. So, there is NO shared variables; just a bunch of tasks that count words by themselves. Here is the output when the code is run on a quad-core i7 CPU:
并行:00:00:01.6220927
序列号:00:00:02.0471171
Parallel: 00:00:01.6220927
Serial: 00:00:02.0471171
加速大约是1.25,这意味着悲剧!但是当我在处理每一行时添加一些延迟时,我可以达到大约 4 的加速值.
The speedup is about 1.25 which means a tragedy! But when I add some delay when processing each line, I can reach speedup values about 4.
在没有延迟的原始并行执行中,CPU 的利用率几乎没有达到 30%,因此加速不乐观.但是,当我们添加一些延迟时,CPU 的利用率会达到 97%.
In the original parallel execution with no delay, CPU's utilization hardly reaches to 30% and therefore the speedup is not promising. But, when we add some delay, CPU's utilization reaches to 97%.
首先,我认为原因是程序的 IO 绑定性质(但我认为插入字典在某种程度上是 CPU 密集型的),这似乎是合乎逻辑的,因为所有线程都从共享内存总线读取数据.然而,令人惊讶的一点是,当我同时运行 4 个串行程序实例(没有延迟)时,CPU 的利用率达到了大约提升,并且所有四个实例都在大约 2.3 秒内完成!
Firstly, I thought the cause is the IO-bound nature of the program (but I think inserting into a dictionary is to some extent CPU intensive) and it seems logical because all of the threads are reading data from a shared memory bus. However, The surprising point is when I run 4 instances of serial programs (with no delays) simultaneously, CPU's utilization reaches to about raises and all of the four instances finish in about 2.3 seconds!
这意味着当代码在多处理配置中运行时,它会达到大约 3.5 的加速值,但在多线程配置中运行时,加速值约为 1.25.
This means that when the code is being run in a multiprocessing configuration, it reaches to a speedup value about 3.5 but when it is being run in multithreading config, the speedup is about 1.25.
你的想法是什么?我的代码有什么问题吗?因为我认为根本没有共享数据,而且我认为代码不会遇到任何争用..NET 的运行时是否存在缺陷?
What is your idea? Is there anything wrong about my code? Because I think there is no shared data at all and I think the code shall not experience any contentions. Is there a flaw in .NET's run-time?
提前致谢.
推荐答案
Parallel.For
不会将输入分成 n 块(其中 n 是 MaxDegreeOfParallelism
);相反,它会创建许多小批量,并确保最多 n 正在同时处理.(这样如果一个批次需要很长时间来处理,Parallel.For
仍然可以在其他线程上运行工作.请参阅 .NET 中的并行性 - 第 5 部分,工作分区了解更多详细信息.)
Parallel.For
doesn't divide the input into n pieces (where n is the MaxDegreeOfParallelism
); instead it creates many small batches and makes sure that at most n are being processed concurrently. (This is so that if one batch takes a very long time to process, Parallel.For
can still be running work on other threads. See Parallelism in .NET - Part 5, Partioning of Work for more details.)
由于这种设计,您的代码会创建并丢弃数十个 Dictionary 对象、数百个 List 对象和数千个 String 对象.这给垃圾收集器带来了巨大的压力.
Due to this design, your code is creating and throwing away dozens of Dictionary objects, hundreds of List objects, and thousands of String objects. This is putting enormous pressure on the garbage collector.
运行 PerfMonitor 报告说总运行时间的 43% 用于 GC.如果您重写代码以使用更少的临时对象,您应该会看到所需的 4 倍加速.以下是 PerfMonitor 报告的部分摘录:
Running PerfMonitor on my computer reports that 43% of the total run time is spent in GC. If you rewrite your code to use fewer temporary objects, you should see the desired 4x speedup. Some excerpts from the PerfMonitor report follow:
超过 10% 的总 CPU 时间花费在垃圾收集器上.大多数经过良好调整的应用都在 0-10% 范围内.这通常是由允许对象存活很长时间的分配模式引起足以需要昂贵的第 2 代系列.
Over 10% of the total CPU time was spent in the garbage collector. Most well tuned applications are in the 0-10% range. This is typically caused by an allocation pattern that allows objects to live just long enough to require an expensive Gen 2 collection.
该程序的 GC 堆分配峰值速率超过 10 MB/秒.这是相当高的.这并不少见,这只是一个性能错误.
This program had a peak GC heap allocation rate of over 10 MB/sec. This is quite high. It is not uncommon that this is simply a performance bug.
根据您的评论,我将尝试解释您报告的时间.在我的计算机上,使用 PerfMonitor,我测量了 43% 到 52% 的时间花在 GC 上.为简单起见,我们假设 50% 的 CPU 时间用于工作,50% 用于 GC.因此,如果我们使工作速度提高 4 倍(通过多线程),但保持 GC 的数量相同(这会发生,因为在并行和串行配置中处理的批次数量恰好相同),最好的我们可以获得的改进是原始时间的 62.5%,即 1.6 倍.
As per your comment, I will attempt to explain the timings you reported. On my computer, with PerfMonitor, I measured between 43% and 52% of time spent in GC. For simplicity, let's assume that 50% of the CPU time is work, and 50% is GC. Thus, if we make the work 4× faster (through multi-threading) but keep the amount of GC the same (this will happen because the number of batches being processed happened to be the same in the parallel and serial configurations), the best improvement we could get is 62.5% of the original time, or 1.6×.
但是,我们只看到 1.25 倍的加速,因为默认情况下 GC 不是多线程的(在工作站 GC 中).根据 垃圾回收基础,所有托管线程在生成期间暂停0 或 Gen 1 集合.(在 .NET 4 和 .NET 4.5 中,并发和后台 GC 可以在后台线程上收集第 2 代.)您的程序只经历了 1.25 倍的加速(并且您看到总体 CPU 使用率为 30%),因为线程花费了它们的大部分时间GC 暂停的时间(因为这个测试程序的内存分配模式很差).
However, we only see a 1.25× speedup because GC isn't multithreaded by default (in workstation GC). As per Fundamentals of Garbage Collection, all managed threads are paused during a Gen 0 or Gen 1 collection. (Concurrent and background GC, in .NET 4 and .NET 4.5, can collect Gen 2 on a background thread.) Your program experiences only a 1.25× speedup (and you see 30% CPU usage overall) because the threads spend most of their time being paused for GC (because the memory allocation pattern of this test program is very poor).
如果启用 服务器 GC,它将在多个线程.如果我这样做,程序运行速度会快 2 倍(几乎 100% 的 CPU 使用率).
If you enable server GC, it will perform garbage collection on multiple threads. If I do this, the program runs 2× faster (with almost 100% CPU usage).
当您同时运行程序的四个实例时,每个实例都有自己的托管堆,并且四个进程的垃圾收集可以并行执行.这就是为什么您会看到 100% 的 CPU 使用率(每个进程都在使用 100% 的一个 CPU).总时间稍长(全部为 2.3 秒,一个为 2.05 秒)可能是由于测量不准确、磁盘争用、加载文件所需的时间、必须初始化线程池、上下文切换的开销或其他原因环境因素.
When you run four instances of the program simultaneously, each has its own managed heap, and the garbage collection for the four processes can execute in parallel. This is why you see 100% CPU usage (each process is using 100% of one CPU). The slightly longer overall time (2.3s for all vs 2.05s for one) is possibly due to inaccuracies in measurement, contention for the disk, time taken to load the file, having to initialise the threadpool, overhead of context switching, or some other environment factor.
这篇关于.NET 的多线程与多处理:糟糕的 Parallel.ForEach 性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!