ConclusionMain takeaways from this project:
- Reading in data from a file is so computationally inexpensive that it does not pay to parallelize the procedure since there is unnecessary communication between processors to gather the data in the end, especially when the final data structure must be in a specific order.
- Although CUDA can speed up computation by orders of magnitude in some cases i.e. applying filters), sometimes communication time between CPU and GPU can be expensive as well. We experienced this directly when trying to assign a face to a thread from the object to image code. Although we got rid of an expensive "for" loop at every iteration, we added communication between GPU and CPU. These two factors could offset one another.
- Parallel computation times can scale linearly with job size if each processor / thread is writing or saving to the same file / array. This is because many processors / threads are trying to access the same piece of memory, yet only one can do it at a given time. Thus, as the number of processors / threads increases, the waits to access the file / array increases linearly, thus destroying weak scaling. We saw this when we used MPI to find line segments intersecting the plane. Each processor had to save its image to the hard disk at the same time (something the hard disk is incapable of doing), consequently, the computation time scaled linearly with the job size. Again we saw this in our CUDA kernel, since as the number of pixels increased, the computation time increased in a linear fashion since all threads were trying to access the same array.
- GPUs cannot write out to a file on the CPU (stdio.h is not even available on PyCuda). Even if they could, the bottleneck in our image to object code could not be parallelized (because order of writing matters). This is good information to know in general.
In order to test how effective our algorithm was, we took a 3D object, decomposed it into 2D images, then rebuilt it using our "voxel-building" algorithm. Here are pictures of the actual object vs. our decomposed-reconstructed object.
In conclusion, we managed to achieve order of magnitude speedups in both the object to image code and image to object code! Input an object file and enjoy!