Authors: Yumin Chen and Jun Tao Luo
Links:
We are going to investigate the performance of a concurrent lock-free priority queue based on multi-dimensional linked lists, which guarantees the worst-case sequential search time of O(logN).
Scalable concurrent priority queues, which are pivotal to topics such as the realization of parallelizing search algorithms, priority task scheduling and discrete event simulation, has been a research topic for many years. INSERT and DELELETEMIN are two canonical operations of a priority queue. In sequential execution scenarios, priority queues can be implemented on top of balanced search trees or array-based binary heaps. However, these two implementations both encounter bottlenecks in concurrent situations for maintaining a consistent global data structure. Recently, researchers attempted to solve this problem by applying skiplists and attained a great breakthrough. The skiplist solution organized several linked lists into different levels to eliminate the need of rebalancing and at the same time separate the memory use within different parts of the structure, which allows concurrent access to the data structure. However, the solution is still deficient in some regards. Both of the operations are restricted by the dependency of data within the structure and limits the overall throughput.
As a performance baseline, we will implement a fine-grained priority queue[2]. Inspired by the ideas of [1], we will investigate the performance of a concurrent lock-free priority queue based on MDList that guarantees a worst-case search time of O(logN). Based on MDList, we are going to implement the concurrent insertion with two steps: node splicing and child adoption[Figure 1], which only updates at most two consecutive nodes.
Figure 1: Insert operation in a 3DList
For the deleteMin operation, we will apply both logical and physical deletion while maintaining a deletion stack to provide the operation the information about the position of the next smallest node to reduce node traversal [Figure 2].
Figure 2: Logical and Physical Deletion
Finally, we will analyze the performance of the MDList concurrent lock-free priority queue given variations in parameters such as the number of threads and the number of dimensions of the linked list using various traffic loads. Finally we will benchmark the performance against some well-established concurrent lock-free priority queue implementation, if applicable, such as TBBPQ by intel and LJPQ by Herlihy and Shavit[3] on NUMA systems.
For the code base, we will build the data structure from scratch with the help of pseudo code provided in [1]. For parallel machine resources, we plan to gather performance data on the PSC Bridges-2 RM.
We will implement the priority_queue in C++ and develop the benchmark scripts in Python. Experiments will be conducted on PSC machines because it can easily scale from 1 core to 256 cores allowing for more comprehensive study into the performance of the data structure when scaling to larger systems. We also chose this platform due to our familiarity and the reliability of performance statistics since we can guarantee exclusive access to a node for benchmarking.
Week | Task | Assignee |
---|---|---|
Nov. 7 - Nov. 13 | Finish project proposal and study related research paper thoroughly | Completed, Jun Tao Luo and Yumin Chen |
Nov. 14 - Nov. 20 | Build Data Structures, including Node, Stack, AdoptDesc and PriorityQueue | Completed, Yumin Chen |
Nov. 14 - Nov. 20 | Implement the coarse-grained concurrent priority queue | Completed, Jun Tao Luo |
Nov. 21 - Nov. 27 | Implement concurrent DELETEMIN, including Logical deletion and batch physical deletion | Completed, Yumin Chen |
Nov. 21 - Nov. 27 | Implement benchmark scripts and reference algorithm (parallel Dijkstra’s) using concurrent priority queues. | Completed, Jun Tao Luo |
Nov. 21 - Nov. 27 | Start to work on project milestone report | Completed, Jun Tao Luo and Yumin Chen |
Nov. 30 | Finish the project milestone report | Completed, Jun Tao Luo and Yumin Chen |
Dec. 2 | Implement concurrent INSERT | Completed, Jun Tao Luo |
Dec. 2 | Test and debug correctness of the MDList priority queue | Completed, Yumin Chen |
Dec. 6 | Profile the MDList priority queue, and analyze performance | Completed, Yumin Chen |
Dec. 6 | Create graphs showing performance statistics and make comparisons with other mature lock-free concurrent priority-queue if applicable | Completed, Jun Tao Luo |
Dec. 9 | Final project report draft | Completed, Jun Tao Luo |
Dec. 9 | Final project poster draft | Completed, Yumin Chen |
Dec. 9 | Evaluate feasibility of HOPE TO ACHIEVE tasks | Completed, Jun Tao Luo and Yumin Chen |
Dec. 13 | Complete final project report | Completed, Jun Tao Luo and Yumin Chen |
Dec. 13 | Complete HOPE TO ACHIEVE tasks | Completed, Jun Tao Luo and Yumin Chen |
Dec. 17 | Complete final project poster | Completed, Jun Tao Luo and Yumin Chen |
We changed the order of some of the tasks outlined in our original proposal, but the planned items are mostly unchanged.
In addition to using only microbenchmarks as a method to evaluate our data structure in our proposal, we decided to add a reference algorithm as a more realistic benchmark. For this purpose we have implemented a parallelized Dijkstra’s algorithm that uses a concurrent priority queue [4]. This has a further benefit of allowing us to empirically confirm the correctness of the concurrent priority queue by comparing the results of the parallelized algorithm with the sequential algorithm.
We have completed our testing harness which includes scripts to generate input graphs for parallelized Dijkstra’s algorithm, test the correctness of the output and compute the performance of the parallelized algorithm using different implementations of concurrent priority queues against the sequential algorithm and each other. We have put off writing specific microbenchmarks since we believe it will be simple but potentially dependent on the exact APIs of each of our priority queue implementations so we’ll work on this once we complete our priority queue implementations.
For the implementation of MDList based lock free concurrent priority queue, we have implemented the data structures it uses, the DeleteMin algorithm and are currently working on Insert. We are also in the process of debugging our current implementation and have not yet tested the functionality end to end using our parallelized Dijkstra’s algorithm.
We measured the speed up for coarse-grained concurrent priority queue compared to the sequential implementation of priority queue of std::priority_queue
for the Dijkstra’s benchmark:
-- Performance Table ---
Scene Name | 4 | 8
-----------------------------------------------------
bench-64 | 0.000247 | 0.000989
bench-256 | 0.0003 | 0.000503
bench-1024 | 0.002944 | 0.00348
bench-4096 | 0.032007 | 0.032112
bench-8192 | 0.12489 | 0.11085
-- Speedup Table ---
Scene Name | 4 | 8
-----------------------------------------------------
bench-64 | 0.068826 | 0.017189
bench-256 | 0.946667 | 0.564612
bench-1024 | 1.756454 | 1.485920
bench-4096 | 3.089168 | 3.079067
bench-8192 | 3.310625 | 3.729941
[1] Zhang, D., & Dechev, D. (2016). A lock-free priority queue design based on multi-dimensional linked lists. IEEE Transactions on Parallel and Distributed Systems, 27(3), 613–626. https://doi.org/10.1109/tpds.2015.2419651 [2] Hunt, G. C., Michael, M. M., Parthasarathy, S., & Scott, M. L. (1996). An efficient algorithm for concurrent priority queue heaps. Information Processing Letters, 60(3), 151–157. https://doi.org/10.1016/s0020-0190(96)00148-2 [3] M. Herlihy and N. Shavit, The Art of Multiprocessor Programming, Revised Reprint. Amsterdam, The Netherlands: Elsevier, 2012. [4] Tamir, O., Morrison, A., & Rinetzky, N. (2016). A heap-based concurrent priority queue with mutable priorities for faster parallel algorithms. In 19th International Conference on Principles of Distributed Systems (OPODIS 2015).