CSCI 338
Parallel Processing
Home | Lectures | Programming Assignments | Links | CS@Williams
Program 7: Rewriting Graph Algs with CUDA
Assigned | Friday, April 25, 2025 |
---|---|
Final Due Dates |
|
Overview
In this assignment, you will revise one graph algorithms assigned to your group to use CUDA. If possible, please use the graph algorithm that you implemented with pthreads.
How the Honor Code Applies to This Assignment
This is a group-eligible assignment. Meaning, you should work with your class-approved team of students to produce your assigments, evaluation, and final reports. You are also encouraged to to ask non-team members questions of clarification, language syntax, and error message interpretation, but are not permitted to view/share each others code or written design notes for part1. Please see the syllabus for more details.
Rewriting Your Graph Algorithms
For this assignment, you will rewrite your graph algorithms to use CUDA and run on GPUs. This will involve transferring your graph data between the host and device, writing a CUDA kernel that performs the primary calculations in the algorithm, and launching that kernel. Setup tasks should continue to be performed by the host.
Enabling CUDA
I have placed a copy of the GMS code on the Azure VM file system. I have
modified the top level CMakeLists.txt file in the gms
directory.
I have also modified the makefile and code in the gms/examples/
subdirectory. The kernel.*
files contain the device code, which is
called from the triangle_counting.cpp
file. Use this as
an example for getting your code to compile.
Evaluating Your Graph Algorithms
You are not expected to come up with the abolutely most efficient CUDA implementation of your algorithm. However, you are expected to make smart design decisions. For this assignment, your goal is to create a reasonably well functioning version of your graph algorithm to run using CUDA. For example, try to have good coalescing of memory requests, but you do not have to use shared memory.
For evaluation of the performance of your code, please collect timing information for your code with at least 4 different configurations of your thread organizations to see how different two- or three-dimensional block sizes impact the performance of your code. (For this assignment, at least two dimensions must have values greater than 1.) Make sure to use an input graph that is sufficiently large that the run times would take about a minute or more on the GPU. Graph your results in your presentation slide and discuss any observations you see as well as insights about why you are observing performance differences with the different thread block sizes. You should also compare your GPU results to results from the original OpenMP implementation that comes with GMS.
For this assignment, I am not requiring you to explore the use of shared memory or other memory hierarchy optimizations (see above), but you may want to as you will be asked to do so in the final project option using CUDAS. Plus, it's kind of fun to play with these optimizations.
Evaluation
This assignment will be evaluated based on the design of your coding approach as well as your presentation of your design choices and performance results. When presenting your approach, you will want to explain why you made the choices regarding organization of data and threads. You'll also want to explain any performance differences you observe across different thread organizations and platforms. As a reminder, you are not expected to have the most efficient CUDA versions of these code; I just want to see that you were thoughtful in your approach.
Submitting Your Work
Please submit your code and your presentation slides as PDF via gitlab.