Performance improvement on latency-bound parallel HPC applications by message sharing between processors
Item Usage Stats
The performance of paralellized High Performance Computing (HPC) applica-tions is tied to the eﬃciency of the underlying processor-to-processor commu-nication. In latency-bound applications, the performance runs into bottleneck by the processor that is sending the maximum number of messages to the other processors. To reduce the latency overhead, we propose a two-phase message-sharing-based algorithm, where the bottleneck processor (the processor sending the maximum number of messages) is paired with another processor. In the ﬁrst phase, the bottleneck processor is paired with the processor that has the maxi-mum number of common outgoing messages. In the second phase, the bottleneck processor is paired with the processor that has the minimum number of outgo-ing messages. In both phases, the processor pair share the common outgoing messages between them, reducing their total number of outgoing messages, but especially the number of outgoing messages of the bottleneck processor. We use Sparse Matrix-Vector Multiplication as the kernel application and a 512-processor setting for the experiments. The proposed message-sharing algorithm achieves a reduction of 84% in the number of messages sent by the bottleneck processor and a reduction of 60% in the total number of messages in the system.