Clock seems to be slightly up from 4.1 to 4.4.
The basic consensus right now, after a morning of frantic back and forth on Twitter and other platforms (I don't have time to give the details justifying this consensus) is that, as far as *CPU* goes
- clock is about 8% faster, presumably at essentially flat power (no mean feat)
- IPC gain is a few percent (very much dependent on exactly how you calculate things), nice but not spectacular
BUT
- addition of SME to the ISA...!!!
Many pieces of evidence that point to this last (totally unexpected) change.
If we look at this in more detail, and look between the lines, what SEEMS to be the case is that
(a) Apple and ARM negotiated some sort of compromise for SME that allows SME to execute on what is essentially AMX hardware. This means that SME is an outer product engine, and can potentially execute as an accelerator, not associated with a single core, just like AMX.
Practical meaning is that
- compiler can now target AMX/SME (don't have to go through Accelerate calls)
- no obvious reason that the AMX/SME hardware will be faster for matrix ops than was the previous AMX hardware (ie if you are already making Accelerate calls, I don't expect them to be much faster than the usual gen to gen speedups)
(b) SME comes with something (which no-one outside Apple/ARM seems to understand yet) called STREAMING SVE mode, which allows for various modifications to SVE. This seems to allow for a separate, different length, set of SVE registers. At the time this was announced (the name and pretty much nothing else) no-one could realy understand it. Now, what I think it means is
+ NEON on M4 gets bumped up to SVE. Quite possibly only as 128b SVE, possibly as 256b SVE. Either way, even if it's only 128b, it gets the compiler improvements that allow SVE to be a somewhat better compiler target than NEON (code that's a few percent faster and a few percent smaller)
+ STREAMING SVE allows the developer to execute 512b vector operations on SME/AMX. Uses the AMX register set. High latency, but great throughput. These vector ops already exist on AMX, but only via Accelerate calls.
So for vector code, the compiler can target SSVE (high throughput, high latency) or NEON/SVE. There will undoubtedly be various pragmas and flags to guide this, and a year of confusion till people figure out optimal patterns.
Interesting point is that GB6.3 is compiled with both SME and SVE. So why did it not get a speed boost from SVE?
Possible answers:
- it did, that's where the few percent IPC improvement of M4 comes from? If SVE is 128b, that's the sort of improvement level we would expect from using SVE128 rather than NEON.
- Apple SVE is 256b, which raises issues given NEONs 128b registers, and so an app has to somehow mark that it wants SVE [which will flip some CPU setting], otherwise it only gets NEON?
As far as Neural Engine goes, SOMETHING was added that dramatically improves the performance of language nets (as opposed to the earlier primary focus on vision nets).
This COULD be the indirect addressing that's referenced in a few recent ANE patents; or it could be the vector DSP that's referenced in two recent patents.
In both cases, the ultimate performance boost of this addition may be much higher; it may be that right now only one or two neural net layers within CoreML have been boosted to use this functionality, and more will follow.
GPU seems unchanged (in the sense that it's M3 GPU, as expected).
Display block obviously is boosted to handle Tandem OLED, with who knows what consequences for lesser screens.
Media block is the great unknown. The references to 8K support [when you compare M4 iPad Pro to M2 iPad Pro] make me suspect that encoding speed has been bumped up (possibly also with slightly better quality) for lesser formats like 4K or 2K h.265 (which were plenty fast on the M2, but waiting two minutes rather than four minutes is always nice!)