Results (2D -> 3D)
Now we analyze the image to object results. From the parallel data, it is clear that the filter times are approximately constant throughout slices, i.e. if 10 slices take 10 s then 100 slices take 100 s. This makes intuitive sense since we imagine that the time it takes to apply a filter to an image does not depend on the contents of the image. We tested this hypothesis by inputing an array of all zeros and an array of all ones and indeed the time needed to apply the filter were approximately the same. So in the serial analysis we make this assumption. The main reason for this assumption is that we did not want to wait 20+ hours for the code to finish running in some extreme cases. Thus we measured the time it takes to get through one slice and multiplied it by the number of images.
Building a Complete Voxel Matrix
We can look at the chart to see the massive speedups that we obtain by using CUDA vs. the serial algorithm. The reasoning for this speedup is clear: threads are assigned to each pixel and can essentially perform all of their calculations simultaneously in the time it takes the serial algorithm to analyze one pixel. What is surprising, however, is that although the CUDA code is orders of magnitude faster than the serial code, both seem to be scaling linearly with job size (although the CUDA exhibits a smaller slope). This linear growth is strange for CUDA since each thread is assigned to a pixel and thus one would expect that no matter the job size, as long as there as many threads as there are pixels the job size should't matter. The most likely explanation is that the slow-down is due to the same overhead problem that we faced in the MPI implementation of the object to image code, namely all threads are simultaneously trying to write into one array; thus perhaps the compiler creates unnecessary waits to access an element of the array and consequently the computation time scales linearly with job size, i.e. the number of pixels allocated. The linear growth in the serial implementation makes sense, since if you double the number of pixels you would expect that the time it takes to apply filters to all pixels will be doubled.
Culling Hidden Voxels
The next part of the code removes extraneous pixels and follows essentially the same procedure as applying the filters, since for each image the code must examine all nearest neighbors just as with the filters. By using CUDA we expect to see the same speedups that we received from applying the filters. Indeed as indicated in the chart above we find massive speedups, yet the speedups scale linearly with job size, just as with our filter data. The same analysis detailed in the previous paragraph applies to this segment of code as well.
Slow Writes to a File
The final piece of the code i.e. where the CPU writes out all of model space into an .obj file unfortunately cannot be parallelized, as mentioned in the project summary. This is by far the largest bottleneck in the code since there is no difference between the serial and parallel implementation of this part of the code. We include a brief list of times that it takes to write out the .obj file for a given number of slices and for a given image size:
726 sec for reconstructed 100 slices of chair
695 sec for reconstructed 100 slices pikachu
111 sec for reconstructed 100 slices of teapot