Particles README
---------------------------------

DISCLAIMER and CREDIT:
This code is offered without warranty, as is.

The base particle code originally came from the class site for UC-Berkeley's 
CS 267. It has been extensively modified since, including the creation of
the GPU codes.  (UC has since also independently developed a CUDA version)
The modifications to the base code were performed by 
Dan Ernst and Brandon Holt while both were at UW-Eau Claire.

OVERVIEW
---------------------------------
This particle simulation is an example of a simple N-Body problem.
It simulates a number of particles that all interact through a
simplistic short-range repulsive force that drops off by the distance
squared (similar to a repulsive electrostatic force).

Executables usage:
(./particles_cpu or ./particles_cuda)
options:
-n # 		(number of particles)
-o filename (name of file to output position data to every time step)
example call: ./particles_cuda -n 10000 -o out.txt

These output files can be visualized with the included visualizer programs.
While you do not want the -o option on for performance testing, it is an
excellent way to examine your code for basic correctness issues.

(Remember: getting the wrong answer quickly isn't helpful!)

Both executables will print to stdout the time it took to calculate
all the forces each time step.

Each directory has a Makefile included to simply recompile the version
within that directory (serial or cuda versions) every time it is called. 
You can build both executables by calling: "make" or each one 
individually by calling "make cuda" or "make cpu" respectively.

LAB
---------------------------------

As each particle feels a force from every other particle, the complexity
is O(n^2). Luckily for us, though, it is highly parallelizable as each
body's acceleration can be calculated independently of the others for
each time step.

For this exercise we will focus on just calculating the forces on the
GPU and will leave the rest of the calculations up to the CPU. A CUDA
kernel has already been written for you which calculates all the forces
for all the particles, but it has been implemented with no regard for
efficiency. Your task will be to rewrite the kernel using what you've
learned about the GPU architecture and CUDA optimization techniques.

Here are some places you might want to start:
- Take a look at where in memory things are, and consider ways that
  you can take advantage of faster kinds of memory. Specifically, you
  could look into using shared memory. Each thread calculates the forces
  on itself due to every other particle, so each one always needs to know
  its own particle's position, but the other particles' positions could
  be loaded into shared memory in chunks (or tiles if you want to think
  about it that way).
- Try tweaking the number of threads and the number of blocks. Currently
  the number of threads per block is set to 256, but it might be faster to
  use more or less.
- You may be able to reduce the number of floating point operations needed
  to calculate the forces. Try compiling with the "-use_fast_math" option
  to use the optimized cuda math functions, and look into "rsqrtf()" in
  particular.