Results (3D -> 2D)
Reading in 3D Object File
As one can see from the graphs, there seems to be no notable speedup by using MPI vs. the serial algorithm in reading from the .obj file and creating the vertex list and face list. Moreover, even as the file size increases (from 530 vertices, 992 faces in the teapot to 844 vertices, 1684 faces in pikachu), the MPI implementation actually becomes slower than the serial implementation. The speedups (or lack thereof) seem to be getting worse with increasing file size. This implies that reading from a file is a sufficiently inexpensive procedure that cannot be easily parallelized. It is merely faster to read in the file in serial, no matter the length of the file.
Finding Intersections and Writing to Image
Next we come to the "finding segments" portion of the code. Here we try to find line segments that intersect each slice plane. At first, we assign each of the p processors the pth slice plane via MPI. The speedups were quite nice (as demonstrated by the graphs), although the efficiency does not stay near 1 for few long (i.e. drops off close to 4 processors, typically). the speedups typically level out at around 32 processors, implying that for a given problem size there is a specific number of processors which creates the maximum speedup; adding extra processors would actually slow down the computation. This is most likely due to many processors simultaneously trying to write large files to the disk. Such access pattern (at least on our non-SSD hard drives) is very slow. A disk-based harddrive simply cannot write multiple large files at the same time.
We thought that the bottleneck in this procedure was that there was an extra "for" loop at every slice plane, since each processor had to check every face for every slice plane. We decided to make a CUDA implementation by assigning each thread to a face of the object, thereby eliminating an extra "for" loop over the faces. Surprisingly this implementation only provided moderate speedups over the serial algorithm. Moreover, it seems to be getting worse with job size, i.e. for pikachu, (image size 1824 by 1824) the speedup is only 1.09 while for the teapot (image size 384, 792) the speedup was close to 1.3. The most likely explanation of this observation is that there is a lot of overhead in the communication between CPU and GPU since we must pass the entire face list (once) and an image array (at every iteration).
We tested for weak scaling in the "finding segments" portion of the code by keeping the job size constant across a number of different processors and number of slices. In the graph we demonstrated that if we assign 25 jobs per processor, as the number of processors increased, the computation time increased linearly. This finding is very strange since the jobs are embarrassingly parallel, i.e. there is literally no communication between processors since each is writing to the same file. One explanation is that as the number of processors increases there is increasing competition to write to the same file and consequently there is a long wait to write to a file. This would explain why the computation time scales linearly with the number of processors despite a constant job per processor ratio.
Thus, it appears that our MPI code exhibits neither weak scaling nor strong scaling.
However, at the end of the day we still saw order of magnitude speedups. Our fastest parallel code (MPI with 16 processors) had a computation time of 1.55s as opposed to 11.66s of its serial ounterpart in 100 slices of teapot. For 100 slices of Pikachu we obtained 6.0s for 32 processors in MPI, 74.12 in serial. Finally for the chair we had 5.7s using 32 processors in MPI, 77.9s in serial. Thus we obtain an order of magnitude increase in speed via our parallelization!