The PS3 Licked the Many Cookie

This post is inspired by I want a good parallel computer

It is important to understand why the PS3 failed. The perspective here was juniorish dev working on simulation and rendering in triple-A. I remember less than I knew and knew less than most!

OG PS3

However, what I can provide here is the hindsight of someone who actually developed and shipped titles 1 on the PS3. I wanted the PS3 to succeed. To be more specific; I wanted Many-Core to succeed.

The PS3 failed developers because it was an excessively heterogenous computer; and low level heterogeneous compute resists composability.2

More like Multicore than Many

The primary characteristic of Many-Core is, by virtue of the name, the high core count. Many-core is simply a tradeoff that enables wide parallelism through more explicit (programmer) control.

CPU GPU Many
Few Complex Cores Wide SIMD Many Simpler Cores
Low latency, OOO, SuperScalar vector instruction pipe, High latency RISC-like, scalar
Cached and Coherency Protocol Fences, flushes, incoherence Message passing, local storage, DMA
Explicit coarse synchronization Implicit scheduling Explicit fine synchronization


At first glance, the SPEs of the PS3 fit the bill. They seem to have all the characteristics of Many-Core. The problem is that most important characteristic, that there is many cores, is significantly lacking.

PS3 heterogeneous diagram First off you didnt get the full 8 SPEs as a (game) developer. Out of the 8 SPEs one was disabled due to die yield and the OS got a core and a half. While this changed with updates one only really got 5-6 SPEs to work with. The Xbox360 in contrast had what amounted to 3 PPEs (2 more). So the Cell really featured at most 3 more (difficult to use) cores than the Xbox360.

Computationally Weak Components

The claim from wiki is a single SPE has 25 GFlops and the PS3 GPU has 192 GFlops. If you absolutely maxed out your SPE usage you would still not even be close to the power of the underpowered PS3 GPU. For contrast the Xbox360 gpu had 240 GFlops. The GPU of the PS3 has separate Vertex and Pixel shading. In contrast, the XBox360 was shared computational resources so it could load balance between heavy vertex shading vs heavy pixel shading. (Examples here would be character skinning vs UI rendering)

PS3 heterogeneous diagram

As a game developer these Flops numbers reflect the experience of developing on these respective platforms. This was particularly noticeable in something like post processing where the demands of the vertex unit are very low (large quads).

Due to the weakness of the GPU vertex unit developers would use the SPEs to do skinning. The pixel shading unit did not have constants. So one would also have to do shader patching on the SPEs before these programs could be sent to the GPU. All of these things require synchronization between the CPU, SPE and GPU and interact with workload balancing. In retrospect I also assume that dynamic branching in the shader was either impossible or prohibitive so this is why everyone did excessive shader permutations. This means 10s of megabytes of shaders. Again contrast this with the XBOX360 which supported wave operations 3 and I even used this feature back in the day. Because each component of the PS3 is weak on its own they all must be employed in concert to compete with (the) less heterogeneous platforms.

Computer Not Super

While the Cell could behave more like a supercomputer I saw it mostly used more like generic GPU compute. I never saw production code that did anything but dispatch N jobs from the PPE. I never saw direct inter SPE communication even though I recall such a thing was possible (mailboxes). This is similar to how GPU inter workgroup workloads are more rare and difficult.

The hetrogenous nature was everywhere. Even the PPE was quite different from an SPE. The SPEs had only vector registers; the PPE had fp, gp, and vector registers. Is this really bad? No4, but it makes everything more heterogeneous and therefore more complex. Getting maximum performance out of these SPE units means that you were likely doing async DMAs while also doing compute work. These nuances could be a fun challenge for a top programmer but ends of being more of an obstacle to development for game studios.

Sharp Edges

The PS3 had 512 Mb total memory but 256 MB was dedicated to graphics and only had REDACTED Mb/s access from the CPU. So this means in addition to the 256 MB purely for graphics you would also have to dedicate system memory for anything that was written to and from the GPU. The point here is inflexibility and heterogeneous nature.

The PS3 had cool features but these were never general purpose and could only be exploited by careful attention to detail and sometimes significant engine changes. I recall using depth bounds for screen space shadowing and one could probably use it for a few other similar gpu techniques (lighting). There was also the alluring double z writes which is a one-off for depth maps if you dont actually use a pixel shader. I don't recall all the edges, but they were sharp and it meant performance cliffs if one strayed off them. The next section covers the sharpest edge of them all.

The Challenge of Local memory

Of course the challenge that everyone knows about the SPEs is the constraint of memory access to local memory. You got 256Kb but in reality once you factored in stack and program you were probably down to 128Kb. This computational model is far more restrictive than even modern GPU compute where at least there you can access storage buffers directly.

Most code and algorithms cannot be trivially ported to the SPE. C++ virtual functions and methods will not work out of the box. C++ encourages dynamic allocation of objects but these can point to anywhere in main memory. You would need to map pointer addresses from PPE to SPE to even attempt running a normal c++ program on the SPE. Also null (address 0x0) points to the start of local memory and is not a segfault to load from it.

So, instead of running generic code on the SPE, what developers did was write handcrafted SPE friendly code for heavy but parallelizable parts of their engine. With enough talent and investment you can eke out the full compute power of the PS3.5 Of course this is maybe easier as a first party developer as you can at least focus on this exotic hardware and craft your engine and game features around the type compute available. This is why Naughty Dog famously came so close to showing us the full potential of the console.

Uncharted 3: Mobygames image
Uncharted 3 PS3 credit mobygames

What could have been

Had the PS3 been what was originally designed it would have been a much more exotic but much less heterogeneous machine.

The original design was approximately 4 Cell processors with high frequencies. Perhaps massaging this design would have led to very homogenous high performance Many-Core architecture. At more than 1 TFlop of general purpose compute it would have been a beast and not a gnarly beast but a sleek smooth uniform tiger.

One has to actually look at the PS3 as the licked cookie 6 of Many-Core designs. This half-baked, half-hearted attempt became synonymous with the failure of Many-Core. I used to think that PS3 set back Many-Core for decades, now I wonder if it simply killed it forever.

Refs

1 Some of the titles I have worked on.

2 I will not provide a proof or reference but the mathematical issue is that the space is not covered uniformly. This low level composability problem is often seen instruction selection when writing assembly.

3 The XBox360 gpu had wave operations like ifany/ifall that are similar to modern control flow subgroup operations.

4 The fact that it was only vectors on the SPEs was present to the programmer due to loads/stores having to also be vector aligned

5 PS3 SPE usage : it is clear that some games had higher utilization than others.

6 I am not sure my usage fits Raymond's narrow specifications.

classic.copetti.org Source for images and some technical specifications.