CSCI 338

Parallel Processing

Home | Lectures | Programming Assignments | Links | CS@Williams

Program 7: Rewriting Graph Algs with CUDA

Assigned Friday, April 25, 2025
Final Due Dates
  • Thursday, May 1 Kernel with data transferred to device
  • Tuesdasy, May 6 Presentation and code

Overview

In this assignment, you will revise one graph algorithms assigned to your group to use CUDA. If possible, please use the graph algorithm that you implemented with pthreads.

How the Honor Code Applies to This Assignment

This is a group-eligible assignment. Meaning, you should work with your class-approved team of students to produce your assigments, evaluation, and final reports. You are also encouraged to to ask non-team members questions of clarification, language syntax, and error message interpretation, but are not permitted to view/share each others code or written design notes for part1. Please see the syllabus for more details.

Rewriting Your Graph Algorithms

For this assignment, you will rewrite your graph algorithms to use CUDA and run on GPUs. This will involve transferring your graph data between the host and device, writing a CUDA kernel that performs the primary calculations in the algorithm, and launching that kernel. Setup tasks should continue to be performed by the host.

Enabling CUDA

I have placed a copy of the GMS code on the Azure VM file system. I have modified the top level CMakeLists.txt file in the gms directory. I have also modified the makefile and code in the gms/examples/ subdirectory. The kernel.* files contain the device code, which is called from the triangle_counting.cpp file. Use this as an example for getting your code to compile.

Evaluating Your Graph Algorithms

You are not expected to come up with the abolutely most efficient CUDA implementation of your algorithm. However, you are expected to make smart design decisions. For this assignment, your goal is to create a reasonably well functioning version of your graph algorithm to run using CUDA. For example, try to have good coalescing of memory requests, but you do not have to use shared memory.

For evaluation of the performance of your code, please collect timing information for your code with at least 4 different configurations of your thread organizations to see how different two- or three-dimensional block sizes impact the performance of your code. (For this assignment, at least two dimensions must have values greater than 1.) Make sure to use an input graph that is sufficiently large that the run times would take about a minute or more on the GPU. Graph your results in your presentation slide and discuss any observations you see as well as insights about why you are observing performance differences with the different thread block sizes. You should also compare your GPU results to results from the original OpenMP implementation that comes with GMS.

For this assignment, I am not requiring you to explore the use of shared memory or other memory hierarchy optimizations (see above), but you may want to as you will be asked to do so in the final project option using CUDAS. Plus, it's kind of fun to play with these optimizations.

Evaluation

This assignment will be evaluated based on the design of your coding approach as well as your presentation of your design choices and performance results. When presenting your approach, you will want to explain why you made the choices regarding organization of data and threads. You'll also want to explain any performance differences you observe across different thread organizations and platforms. As a reminder, you are not expected to have the most efficient CUDA versions of these code; I just want to see that you were thoughtful in your approach.

Submitting Your Work

Please submit your code and your presentation slides as PDF via gitlab.