GPU Benchmarking

(8/29/2007) This page describes some early investiagtions into the use of GPUs for real time signal processing. Here I compare the performance of the GPU and CPU for doing FFTs, and make a rough estimate of the performance of this system for coherent dedispersion.




For each FFT length tested:

  1. 8M random complex floats are generated (64MB total size).
  2. The data is transferred to the GPU (if necessary).
  3. The data is split into 8M/fft_len chunks, and each is FFT'd (using a single FFTW/CUFFT "batch mode" call).
  4. The FFT results are transferred back from the GPU.
The real time taken by each step is measured. The process is iterated and the times are averaged. All FFTs are done in-place. FFTW plans were computed with FFTW_MEASURE, and were not multi-threaded.


The first plot shows time taken per FFT. The CUDA results are shown with data transfer time included (CUDA(tot) line) or not (CUDA(fft) line). The GPU results here are seen to be dominated by the data transfer time:

The same data plotted using FFTW's performance metric in Gflops:

Finally, we can measure the data tranfer rate to/from the GPU for each trial. Performance is improved by allocating the transfer buffer using cudaMallocHost rather than plain malloc. The theoretical maximum data rate through a PCIe x16 slot is 31.25 Gb/s.

Coherent Dedispersion

We can use the numbers just measured to make a rough estimate of the total bandwidth that could be coherently dedispersed in real time by one of these systems. This estimate makes the following assumptions: Given these assumptions, this plot shows the estimated total bandwidth that could be handled by one system, as a function of DM, for RFs of 430 MHz and 1.4 GHz:

Coherent Dedispersion, part II

(9/11/2007) This is a update on the possibility of performing coherent dedispersion on the GPU, after implementing and benchmarking a basic program to do so. The program performs the following steps, each of which is timed:
  1. Approx. 25 MB of raw (8-bit, 2 pol, complex) data is transferred from main memory to the GPU.
  2. The data are split into Nfft buffers, which are overlapped in time by one chirp length.
  3. The 8-bit data is converted to single precision floating point.
  4. Each buffer is FFT'd, multiplied by a chirp function, then inverse FFT'd.
  5. The data are squared to form power and polarization cross products.
  6. The results are then transferred back to main memory.
The program was run for chirp lengths ranging from 16 to 32k samples. For each chirp, FFT lengths from 2*(chirp length) to 4M samples were timed, and the best-performing FFT length was used to make the following plots. First, we can show the maximum real-time bandwidth as a function of chirp length. This also shows DM as a function of chirp length for several standard RF and channel BW combinations:

These results agree fairly well with the rough estimate of the previous section, especially in the high-DM regime where the FFT dominates the computation. More insight can be gained by looking at the timing breakdown for each of these trials. In all cases, the computation is dominated by three of the steps listed above: FFT (step 4, not counting the multiply), data transfer to/from GPU (steps 1 and 6), and overlapping the data (step 2):
The data transfer time is constant, as expected. However, the overlap takes a significant amount of time at short chirps (many small FFTs), accounting for some of the discrepancy between these results and the rough estimate. This could potentially be improved by additional optimization of the overlap procedure (currently implemented using cudaMemcpy).

This test has addressed several of the assumptions of the previous section, with the exception of:

Folding and the GPU