CSCI 338

Parallel Processing

Home | Lectures | Programming Assignments | Links | CS@Williams

Program 6: CUDA Programming

Assigned Thursday, April 17, 2025
Final Due Date Thursday, April 24, 2025

Overview

The purpose of this assignment is to become familiar converting parallel code initially written for a CPU-based system to use GPUs. To do this, you'll be reimplementing your blurring filter to use CUDA and run on a GPU.

How the Honor Code Applies to This Assignment

This is a group-eligible assignment. Meaning, you should work with your class-approved team of students to produce your assigments, evaluation, and final reports. You are also encouraged to to ask non-team members questions of clarification, language syntax, and error message interpretation, but are not permitted to view/share each others code or written design notes for part1. Please see the syllabus for more details.

Overview

When converting code to use a new parallel programming system, there are several steps. The first is to create a correctly functioning version of the parallel code on the system. The second step is to try to optimize your code in a variety of ways, taking advantage of key characteristics of the system, to improve the performance of your code. Not all of the optimizations you try always improve your performance; there is a lot of exploring different ideas and seeing which ones improve performance. In this assignment, you'll look at how your configuration of threads impacts performance of your code on GPUs.

You'll be working in the cuda-samples/Samples/0_Introduction/ subdirectory where you'll find a directory named cs338_. In it, you'll find starter code in a file called cs338Blur.cu that reads in and writes out JPEG files. There is also a sample CUDA kernel framework in the code already. You should be able to type make in the cuda-samples/build directory. You can then navigate to the appropriate directory in the build directory to run your code.

As you write your code, make sure to do error checking on all CUDA functionality, using the various error checking functionality such as checkCudaErrors.

To collect execution time numbers for the kernels, use CUDA Events. For the purposes of these tests, do not include memory transfer times in your execution time numbers. Also, remember to collect multiple runs of the kernel and average their results together to avoid outliers.

Writing correct but naive code

For this assignment, you'll be implementing the naive blurring algorithm from program 3 as a CUDA kernel. Blocks and grids should both be two-dimensional. Each thread should perform calculations for a single output pixel.

One of the issues you will need to deal with for this section is transforming the two dimensional array of values into a flattened single dimension array that gets passed to the CUDA kernel. Similarly, you'll need to convert that single dimension array back into its two dimensional array for writing to the output file after the CUDA kernel finishes.

To check the correctness of your CUDA kernel, you should also write a version of the blurring filter to run on the host CPU and then compare the results from that uniprocessor version to the results obtained by your CUDA kernel. There is already a checkResults() function in the starter code.

Experiment with at least four different configurations of your thread organizations to see how different two-dimensional block sizes impact the performance of your code. (For this assignment, each of the two dimensions must have values greater than 1.) Graph your results in your presentation slides and discuss any observations you see as well as insights about why you are observing performance differences with the different thread block sizes.

Presentation

In addition to submitting your code, you need to create a presentation for class which shows the different block sizes you have chosen to use and their execution times. You should use one of the larger of the pictures in the picture/ directory (i.e., fall.jpg, green.jpg, peacock.jpg or gorge.jpg) to generate these numbers; you're looking for runtimes in the minute(s) range for the kernel. Remember to not include CPU-GPU memory transfer times. Your presentation should also provide your observations about the performance differences and explanations of why the results varied for different configurations.

Evaluation

Your grade will be based on a variety of components. First, 75% of your grade will be based on the correctness of your implementations. Second, 25% of your grade will be based on your presentation.

Submitting Your Work

Please submit your code and presentation slides via gitlab.