This post is inspired by I want a good parallel computer
It is important to understand why the PS3 failed. The perspective here was juniorish dev working on simulation and rendering in triple-A. I remember less than I knew and knew less than most!
However, what I can provide here is the hindsight of someone who actually developed and shipped titles 1 on the PS3. I wanted the PS3 to succeed. To be more specific; I wanted Many-Core to succeed.
The PS3 failed developers because it was an excessively heterogenous computer; and low level heterogeneous compute resists composability.2
The primary characteristic of Many-Core is, by virtue of the name, the high core count. Many-core is simply a tradeoff
that enables wide parallelism through more explicit (programmer) control.
CPU | GPU | Many |
---|---|---|
Few Complex Cores | Wide SIMD | Many Simpler Cores |
Low latency, OOO, SuperScalar | vector instruction pipe, High latency | RISC-like, scalar |
Cached and Coherency Protocol | Fences, flushes, incoherence | Message passing, local storage, DMA |
Explicit coarse synchronization | Implicit scheduling | Explicit fine synchronization |
The claim from wiki is a single SPE has 25 GFlops and the PS3 GPU has 192 GFlops. If you absolutely maxed out your SPE usage you would still not even be close to the power of the underpowered PS3 GPU. For contrast the Xbox360 gpu had 240 GFlops. The GPU of the PS3 has separate Vertex and Pixel shading. In contrast, the XBox360 was shared computational resources so it could load balance between heavy vertex shading vs heavy pixel shading. (Examples here would be character skinning vs UI rendering)
As a game developer these Flops numbers reflect the experience of developing on these respective platforms. This was particularly noticeable in something like post processing where the demands of the vertex unit are very low (large quads).
Due to the weakness of the GPU vertex unit developers would use the SPEs to do skinning. The pixel shading unit did not have constants. So one would also have to do shader patching on the SPEs before these programs could be sent to the GPU. All of these things require synchronization between the CPU, SPE and GPU and interact with workload balancing. In retrospect I also assume that dynamic branching in the shader was either impossible or prohibitive so this is why everyone did excessive shader permutations. This means 10s of megabytes of shaders. Again contrast this with the XBOX360 which supported wave operations 3 and I even used this feature back in the day. Because each component of the PS3 is weak on its own they all must be employed in concert to compete with (the) less heterogeneous platforms.
While the Cell could behave more like a supercomputer I saw it mostly used more like generic GPU compute. I never saw production code that did anything but dispatch N jobs from the PPE. I never saw direct inter SPE communication even though I recall such a thing was possible (mailboxes). This is similar to how GPU inter workgroup workloads are more rare and difficult.
The hetrogenous nature was everywhere. Even the PPE was quite different from an SPE. The SPEs had only vector registers; the PPE had fp, gp, and vector registers. Is this really bad? No4, but it makes everything more heterogeneous and therefore more complex. Getting maximum performance out of these SPE units means that you were likely doing async DMAs while also doing compute work. These nuances could be a fun challenge for a top programmer but ends of being more of an obstacle to development for game studios.
The PS3 had 512 Mb total memory but 256 MB was dedicated to graphics and only had REDACTED Mb/s access from the CPU. So this means in addition to the 256 MB purely for graphics you would also have to dedicate system memory for anything that was written to and from the GPU. The point here is inflexibility and heterogeneous nature.
The PS3 had cool features but these were never general purpose and could only be exploited by careful attention to detail and sometimes significant engine changes. I recall using depth bounds for screen space shadowing and one could probably use it for a few other similar gpu techniques (lighting). There was also the alluring double z writes which is a one-off for depth maps if you dont actually use a pixel shader. I don't recall all the edges, but they were sharp and it meant performance cliffs if one strayed off them. The next section covers the sharpest edge of them all.
Of course the challenge that everyone knows about the SPEs is the constraint of memory access to local memory. You got 256Kb but in reality once you factored in stack and program you were probably down to 128Kb. This computational model is far more restrictive than even modern GPU compute where at least there you can access storage buffers directly.
Most code and algorithms cannot be trivially ported to the SPE. C++ virtual functions and methods will not work out of the box. C++ encourages dynamic allocation of objects but these can point to anywhere in main memory. You would need to map pointer addresses from PPE to SPE to even attempt running a normal c++ program on the SPE. Also null (address 0x0) points to the start of local memory and is not a segfault to load from it.
So, instead of running generic code on the SPE, what developers did was write handcrafted SPE friendly code for
heavy
but parallelizable parts of their engine. With enough talent and investment you can eke out the full compute
power of the PS3.5 Of course this is maybe easier as a first party developer as you
can at least
focus on this exotic hardware and craft your engine and game features around the type compute available. This is
why Naughty Dog
famously came so close to showing us the full potential of the console.
Had the PS3 been what was originally designed it would have been a much more exotic but much less heterogeneous machine.
The original design was approximately 4 Cell processors with high frequencies. Perhaps massaging this design would have led to very homogenous high performance Many-Core architecture. At more than 1 TFlop of general purpose compute it would have been a beast and not a gnarly beast but a sleek smooth uniform tiger.
One has to actually look at the PS3 as the licked cookie 6 of Many-Core designs. This half-baked, half-hearted attempt became synonymous with the failure of Many-Core. I used to think that PS3 set back Many-Core for decades, now I wonder if it simply killed it forever.
1 Some of the titles I have worked on.
2 I will not provide a proof or reference but the mathematical issue is that the space is not covered uniformly. This low level composability problem is often seen instruction selection when writing assembly.
3 The XBox360 gpu had wave operations like ifany/ifall that are similar to modern control flow subgroup operations.
4 The fact that it was only vectors on the SPEs was present to the programmer due to loads/stores having to also be vector aligned
5 PS3 SPE usage : it is clear that some games had higher utilization than others.
6 I am not sure my usage fits Raymond's narrow specifications.
classic.copetti.org Source for images and some technical specifications.