Lock-Free Priority Queue Based on Multi-Dimensional Linked Lists

Lock-Free Priority Queue Based on Multi-Dimensional Linked Lists

Authors: Yumin Chen and Jun Tao Luo

Links:

Summary

We are going to investigate the performance of a concurrent lock-free priority queue based on multi-dimensional linked lists, which guarantees the worst-case sequential search time of O(logN).

Background

Scalable concurrent priority queues, which are pivotal to topics such as the realization of parallelizing search algorithms, priority task scheduling and discrete event simulation, has been a research topic for many years. INSERT and DELELETEMIN are two canonical operations of a priority queue. In sequential execution scenarios, priority queues can be implemented on top of balanced search trees or array-based binary heaps. However, these two implementations both encounter bottlenecks in concurrent situations for maintaining a consistent global data structure. Recently, researchers attempted to solve this problem by applying skiplists and attained a great breakthrough. The skiplist solution organized several linked lists into different levels to eliminate the need of rebalancing and at the same time separate the memory use within different parts of the structure, which allows concurrent access to the data structure. However, the solution is still deficient in some regards. Both of the operations are restricted by the dependency of data within the structure and limits the overall throughput.

As a performance baseline, we will implement a fine-grained priority queue[2]. Inspired by the ideas of [1], we will investigate the performance of a concurrent lock-free priority queue based on MDList that guarantees a worst-case search time of O(logN). Based on MDList, we are going to implement the concurrent insertion with two steps: node splicing and child adoption[Figure 1], which only updates at most two consecutive nodes.

Node splicing and child adoption Figure 1: Insert operation in a 3DList

For the deleteMin operation, we will apply both logical and physical deletion while maintaining a deletion stack to provide the operation the information about the position of the next smallest node to reduce node traversal [Figure 2].

deleteMin operation Figure 2: Logical and Physical Deletion

Finally, we will analyze the performance of the MDList concurrent lock-free priority queue given variations in parameters such as the number of threads and the number of dimensions of the linked list using various traffic loads. Finally we will benchmark the performance against some well-established concurrent lock-free priority queue implementation, if applicable, such as TBBPQ by intel and LJPQ by Herlihy and Shavit[3] on NUMA systems.

The Challenge

Resources

For the code base, we will build the data structure from scratch with the help of pseudo code provided in [1]. For parallel machine resources, we plan to gather performance data on the PSC Bridges-2 RM.

Goals and Deliverables

Platform Choice

We will implement the priority_queue in C++ and develop the benchmark scripts in Python. Experiments will be conducted on PSC machines because it can easily scale from 1 core to 256 cores allowing for more comprehensive study into the performance of the data structure when scaling to larger systems. We also chose this platform due to our familiarity and the reliability of performance statistics since we can guarantee exclusive access to a node for benchmarking.

Schedule

Week Task Assignee
Nov. 7 - Nov. 13 Finish project proposal and study related research paper thoroughly Completed, Jun Tao Luo and Yumin Chen
Nov. 14 - Nov. 20 Build Data Structures, including Node, Stack, AdoptDesc and PriorityQueue Completed, Yumin Chen
Nov. 14 - Nov. 20 Implement the coarse-grained concurrent priority queue Completed, Jun Tao Luo
Nov. 21 - Nov. 27 Implement concurrent DELETEMIN, including Logical deletion and batch physical deletion Completed, Yumin Chen
Nov. 21 - Nov. 27 Implement benchmark scripts and reference algorithm (parallel Dijkstra’s) using concurrent priority queues. Completed, Jun Tao Luo
Nov. 21 - Nov. 27 Start to work on project milestone report Completed, Jun Tao Luo and Yumin Chen
Nov. 30 Finish the project milestone report Completed, Jun Tao Luo and Yumin Chen
Dec. 2 Implement concurrent INSERT Completed, Jun Tao Luo
Dec. 2 Test and debug correctness of the MDList priority queue Completed, Yumin Chen
Dec. 6 Profile the MDList priority queue, and analyze performance Completed, Yumin Chen
Dec. 6 Create graphs showing performance statistics and make comparisons with other mature lock-free concurrent priority-queue if applicable Completed, Jun Tao Luo
Dec. 9 Final project report draft Completed, Jun Tao Luo
Dec. 9 Final project poster draft Completed, Yumin Chen
Dec. 9 Evaluate feasibility of HOPE TO ACHIEVE tasks Completed, Jun Tao Luo and Yumin Chen
Dec. 13 Complete final project report Completed, Jun Tao Luo and Yumin Chen
Dec. 13 Complete HOPE TO ACHIEVE tasks Completed, Jun Tao Luo and Yumin Chen
Dec. 17 Complete final project poster Completed, Jun Tao Luo and Yumin Chen

Current Progress

We changed the order of some of the tasks outlined in our original proposal, but the planned items are mostly unchanged.

In addition to using only microbenchmarks as a method to evaluate our data structure in our proposal, we decided to add a reference algorithm as a more realistic benchmark. For this purpose we have implemented a parallelized Dijkstra’s algorithm that uses a concurrent priority queue [4]. This has a further benefit of allowing us to empirically confirm the correctness of the concurrent priority queue by comparing the results of the parallelized algorithm with the sequential algorithm.

We have completed our testing harness which includes scripts to generate input graphs for parallelized Dijkstra’s algorithm, test the correctness of the output and compute the performance of the parallelized algorithm using different implementations of concurrent priority queues against the sequential algorithm and each other. We have put off writing specific microbenchmarks since we believe it will be simple but potentially dependent on the exact APIs of each of our priority queue implementations so we’ll work on this once we complete our priority queue implementations.

For the implementation of MDList based lock free concurrent priority queue, we have implemented the data structures it uses, the DeleteMin algorithm and are currently working on Insert. We are also in the process of debugging our current implementation and have not yet tested the functionality end to end using our parallelized Dijkstra’s algorithm.

Preliminary Results

We measured the speed up for coarse-grained concurrent priority queue compared to the sequential implementation of priority queue of std::priority_queue for the Dijkstra’s benchmark:

 -- Performance Table ---
 Scene Name      | 4               | 8
-----------------------------------------------------
 bench-64        | 0.000247        | 0.000989
 bench-256       | 0.0003          | 0.000503
 bench-1024      | 0.002944        | 0.00348
 bench-4096      | 0.032007        | 0.032112
 bench-8192      | 0.12489         | 0.11085

-- Speedup Table ---
 Scene Name      | 4               | 8
-----------------------------------------------------
 bench-64        | 0.068826        | 0.017189
 bench-256       | 0.946667        | 0.564612
 bench-1024      | 1.756454        | 1.485920
 bench-4096      | 3.089168        | 3.079067
 bench-8192      | 3.310625        | 3.729941

References

[1] Zhang, D., & Dechev, D. (2016). A lock-free priority queue design based on multi-dimensional linked lists. IEEE Transactions on Parallel and Distributed Systems, 27(3), 613–626. https://doi.org/10.1109/tpds.2015.2419651 [2] Hunt, G. C., Michael, M. M., Parthasarathy, S., & Scott, M. L. (1996). An efficient algorithm for concurrent priority queue heaps. Information Processing Letters, 60(3), 151–157. https://doi.org/10.1016/s0020-0190(96)00148-2 [3] M. Herlihy and N. Shavit, The Art of Multiprocessor Programming, Revised Reprint. Amsterdam, The Netherlands: Elsevier, 2012. [4] Tamir, O., Morrison, A., & Rinetzky, N. (2016). A heap-based concurrent priority queue with mutable priorities for faster parallel algorithms. In 19th International Conference on Principles of Distributed Systems (OPODIS 2015).