- CPU: Intel Core 2 Quad, 2.4GHz
- GPU: NVIDIA GeForce 8800 GTX

For each FFT length tested:

- 8M random complex floats are generated (64MB total size).
- The data is transferred to the GPU (if necessary).
- The data is split into 8M/fft_len chunks, and each is FFT'd (using a single FFTW/CUFFT "batch mode" call).
- The FFT results are transferred back from the GPU.

The first plot shows time taken per FFT. The CUDA results are shown with data transfer time included (CUDA(tot) line) or not (CUDA(fft) line). The GPU results here are seen to be dominated by the data transfer time:

The same data plotted using FFTW's performance metric in Gflops:

Finally, we can measure the data tranfer rate to/from the GPU for each trial. Performance is improved by allocating the transfer buffer using cudaMallocHost rather than plain malloc. The theoretical maximum data rate through a PCIe x16 slot is 31.25 Gb/s.

- "ASP-like" system parameters: 4-MHz channels, 8-bit sampling.
- Data unpacking (byte-to-float conversion) will be done in the GPU (reduces data transfer time per block by a factor of 4).
- Detection and folding will be implemented in the GPU, so there will be only one significant data transfer and two FFTs per block.
- The time needed for unpacking, filter multiplication,
detection and folding is constant as a function of DM (or FFT
length), and is set so that it is 25% of the total time at 32k
FFT lengths. This is based on ASP experience, but could be
totally wrong here! We won't know until the whole algorithm is
tried in the GPU (
**Update: See below**). - Overlap is 25% (or, FFT length is 4 times chirp length).
- The data acquisition (from network, hard drive, EDT card, etc) can be done concurrently with dedispersion and does not negatively affect the PCIe transfer rate. This seems likely at these data rates, but hasn't been tested yet.

- Approx. 25 MB of raw (8-bit, 2 pol, complex) data is transferred from main memory to the GPU.
- The data are split into N
_{fft}buffers, which are overlapped in time by one chirp length. - The 8-bit data is converted to single precision floating point.
- Each buffer is FFT'd, multiplied by a chirp function, then inverse FFT'd.
- The data are squared to form power and polarization cross products.
- The results are then transferred back to main memory.

These results agree fairly well with the rough estimate of the previous section, especially in the high-DM regime where the FFT dominates the computation. More insight can be gained by looking at the timing breakdown for each of these trials. In all cases, the computation is dominated by three of the steps listed above: FFT (step 4, not counting the multiply), data transfer to/from GPU (steps 1 and 6), and overlapping the data (step 2):

The data transfer time is constant, as expected. However, the overlap takes a significant amount of time at short chirps (many small FFTs), accounting for some of the discrepancy between these results and the rough estimate. This could potentially be improved by additional optimization of the overlap procedure (currently implemented using cudaMemcpy).

This test has addressed several of the assumptions of the previous section, with the exception of:

- Data acquistion (no 10 GbE card to test this with yet).
- Folding is not performed (see discussion below).