The Vega64 GPU that AMD just released has 4096 "threads" operating at one time (...

The Vega64 GPU that AMD just released has 4096 "threads" operating at one time (actually in flight), with up to 40,960 of them "at the ready" at the hardware level (kinda like Hyperthreading, except GPUs keep up to ~10 threads per "shader core" in memory for quick swap in-and-outs). Subject to memory requirements of course. A program that uses a ton of vGPR Registers on the AMD system may "only" accomplish 4096 threads at a time, and maybe only 5 per core (aka 20,480) of them are needed for maximum occupancy.

Its a weird architecture because of how each "thread" shares an instruction pointer (ie: NVidia has 32 threads per wavefront, AMD has 64 workitems per Work Group), so its not "really" the same kind of "thread" as in Linux pthreads. But still, the scope of parallelism on a $500 GPU today is rather outstanding.

All of these threads could potentially hit the same global memory at the same time. I mean, if you want bad performance of course, but its entirely possible since the global memory space is shared between all compute units in a GPU.