Apple Debuts Next-Generation M4 Chip

Confused-User · May 8, 2024

gaximus said:
Since you clearly know more about binning chips than I do, you know that chips get binned for lots of reasons, like when they don’t run at the rated speed, or require more power to get there. Not just bad cores. A pretty much every chip gets binned some way. These chips will go into iPads, these chips will go into macbook airs, ”what about these? they are quite horrible” “put them in a HomePod” don’t waste anything.

They could have the same number of working cores and be binned as a lessor chip.

Really, you're going to double down on this total stupidity???

Just no. Most of what you said is at least partially true, but it does not apply here. Mx is simply NOT a "binned" Mx Pro, and Mx Pro is not a binned Mx Max. We know this. We have die shots of each chip, and people have delidded the various processors to look at them.

Are they binning M4 production for Macs vs. iPads? Maybe. Not a bad guess. We don't know. 3P vs. 4P core iPPs? Sure. The rest of it is just uninformed nonsense.

And BTW, they don't put Mx in homepods. Those use A series chips, possibly binned- but likely not, since they're not produced for any other product now.

Johnny365 · May 8, 2024

Close enough

Luposian · May 8, 2024

MacATDBB said:
When the M3 came out, Apple mostly compared it to the M1. Now the M4 comes out and Apple mostly compare it to the M2. I'm not saying that apple are being disingenuous here, but Apple appear to be being disingenuous here.

The difference isn't large enough between the M3/M4, so they use the M2, where it looks... reasonably impressive. And, as long as there are still M2 devices being sold by Apple, it's "legit".

chucker23n1 · May 9, 2024

gaximus said:
Since you clearly know more about binning chips than I do, you know that chips get binned for lots of reasons, like when they don’t run at the rated speed, or require more power to get there. Not just bad cores. A pretty much every chip gets binned some way. These chips will go into iPads, these chips will go into macbook airs, ”what about these? they are quite horrible” “put them in a HomePod” don’t waste anything.

They could have the same number of working cores and be binned as a lessor chip.

M3, M3 Pro, and M3 Max are distinct SoC designs. Your implication here is that the M3 is an M3 Max with some stuff disabled, and that’s absolutely not the case. There are, however, variants of those SoCs with individual cores disabled, and those are commonly a case of binning.

chucker23n1 · May 9, 2024

chucker23n1 said:
Assuming same clock: single-core will be the same, and multi-core will only go up <5% if that.

I may have some hat-eating to do. It looks like the M4 scores around 3,800 / 14,600, which is up 22% / 25% (that multi-core speed is almost as high as the M3 Pro’s). Clock is up from 4.1 GHz to 4.4, but even per-clock, performance is up 13%.

ric22 · May 9, 2024

So, after all the AI hype and manipulated TOPS figures, the M4 neural engine is only approximately 5% faster than the one in the M3, when using the same level of precision to calculate the TOPS figures?!?

Alleged Apple M4 Geekbench scores show incremental improvement in machine learning over last gen

Incremental improvement over last gen

www.tomshardware.com

Abazigal · May 9, 2024

MacATDBB said:
When the M3 came out, Apple mostly compared it to the M1. Now the M4 comes out and Apple mostly compare it to the M2. I'm not saying that apple are being disingenuous here, but Apple appear to be being disingenuous here.

There is no iPad with the M3 chip, so a comparison with an M3 Mac wouldn’t be all that meaningful either.

Confused-User · May 9, 2024

ric22 said:
So, after all the AI hype and manipulated TOPS figures, the M4 neural engine is only approximately 5% faster than the one in the M3, when using the same level of precision to calculate the TOPS figures?!?

Alleged Apple M4 Geekbench scores show incremental improvement in machine learning over last gen

Incremental improvement over last gen

www.tomshardware.com

IIUC, no, not even close, because of a mistake everyone keeps making.

I'm not 100% sure of this - perhaps @leman can confirm - but I believe that unlike nVidia chips, where X FP16 ops generally means you get 2X INT8 ops, on the M1-M3, the performance was the same for INT8 and FP16.

Assuming that's correct, then the big difference is that Apple changed their architecture enough to add the same feature nVidia has, where you can double perf when using half the bits per value. So their claim of more-than-doubling is absolutely correct - you got 18 TOPS INT8 on the M3, but now you get 38 TOPS on the M4.

Aries79 · May 9, 2024

The first GeekBench results for the M4 are out, and they are mindblowing:

Single Core 3800
Multi Core 14600
Metal 53000

By comparison, the M3 scores where:

Single Core 3100
Multi Core 11800
Metal 35000

And this is with a slightly slower base clock speed. Apparently the per core optimizations Apple was talking about have a great effect.

https://www.tomshardware.com/pc-com...-i9-14900ks-incredible-results-of-3800-posted

gaximus · May 9, 2024

Confused-User said:
Really, you're going to double down on this total stupidity???

Just no. Most of what you said is at least partially true, but it does not apply here. Mx is simply NOT a "binned" Mx Pro, and Mx Pro is not a binned Mx Max. We know this. We have die shots of each chip, and people have delidded the various processors to look at them.

Are they binning M4 production for Macs vs. iPads? Maybe. Not a bad guess. We don't know. 3P vs. 4P core iPPs? Sure. The rest of it is just uninformed nonsense.

And BTW, they don't put Mx in homepods. Those use A series chips, possibly binned- but likely not, since they're not produced for any other product now.

I may be wrong about Mx Pro being binned, but my original point was not to be surprised about the chips being binned in the lower end products. At least Apple is being honest about what your getting in the spec sheet this time and not giving you a MacBook with half the read speeds because your buying the lowest storage tier of MacBook. Also I never said anything about the HomePod using an Mx Processor, I was referring to the S7 processor used in the HomePod 2, that probably got binned due to needing more power to reach the speeds.

chucker23n1 · May 9, 2024

Aries79 said:
And this is with a slightly slower base clock speed.

Clock seems to be slightly up from 4.1 to 4.4.

leman · May 9, 2024

Confused-User said:
I'm not 100% sure of this - perhaps @leman can confirm - but I believe that unlike nVidia chips, where X FP16 ops generally means you get 2X INT8 ops, on the M1-M3, the performance was the same for INT8 and FP16.

That is what @name99 is saying and also what the benchmarks seems to suggest. There is no notable difference between the I8 and FP16 tests.

Confused-User said:
Assuming that's correct, then the big difference is that Apple changed their architecture enough to add the same feature nVidia has, where you can double perf when using half the bits per value. So their claim of more-than-doubling is absolutely correct - you got 18 TOPS INT8 on the M3, but now you get 38 TOPS on the M4.

It is also important to keep in mind that the NPU fulfills a different function than the Nvidia tensor cores. The NPU is a specialized device for ML inference, while tensor cores are a general-purpose matrix multiplication unit. It is very much possible that Apple feels there is no need to improve INT8 performance at current time since these workflows are bandwidth-limited rather than compute-limited.

Also, Apple does ship a hardware unit that supports breaking down ALUs into smaller parts for faster performance — the AMX units. Those will be more useful for training, especially now that we can program them with SME, although I wonder what the performance will be. There is little doubt that future Macs will be monsters for scientific computing, already M3 Max DGEMM throughput is comparable to that of an RTX 4090

Confused-User · May 9, 2024

gaximus said:
I may be wrong about Mx Pro being binned, but my original point was not to be surprised about the chips being binned in the lower end products.[...]

There is no "may" about it. You're wrong, as I and others have posted. The Mx, Mx Pro, and Mx Max are all different chips. (Even more so as of the M3 generation, where the Pro is no longer a sliced-off Max.) This has been well known since the first M1 Pro/Max shipped.

Further, many lower-end products are not "binned" in the usual sense ever- though sometimes they are. That's because the lower-end products are physically smaller. So a single defect in the chip, if you throw it out, costs you less in lost silicon than it would on a larger chip. The M series are all binned (though a lot of what you probably think of as binning is really just purposeful market segmentation). On the other hand, most A series chips were not (though a few were).

ric22 · May 9, 2024

Confused-User said:
IIUC, no, not even close, because of a mistake everyone keeps making.

I'm not 100% sure of this - perhaps @leman can confirm - but I believe that unlike nVidia chips, where X FP16 ops generally means you get 2X INT8 ops, on the M1-M3, the performance was the same for INT8 and FP16.

Assuming that's correct, then the big difference is that Apple changed their architecture enough to add the same feature nVidia has, where you can double perf when using half the bits per value. So their claim of more-than-doubling is absolutely correct - you got 18 TOPS INT8 on the M3, but now you get 38 TOPS on the M4.

I think we need an expert to answer this...

Confused-User · May 9, 2024

ric22 said:
I think we need an expert to answer this...

Done, see #162.

ric22 · May 9, 2024

Aries79 said:
The first GeekBench results for the M4 are out, and they are mindblowing:

Single Core 3800
Multi Core 14600
Metal 53000

By comparison, the M3 scores where:

Single Core 3100
Multi Core 11800
Metal 35000

And this is with a slightly slower base clock speed. Apparently the per core optimizations Apple was talking about have a great effect.

https://www.tomshardware.com/pc-com...-i9-14900ks-incredible-results-of-3800-posted

If this is true it makes up for the rather unimpressive M2 and M3 (unimpressive compared to M1).

leman · May 9, 2024

ric22 said:
I think we need an expert to answer this...

We need better documentation, that’s what. Right now we can only guess.

ric22 · May 9, 2024

ric22 said:
I think we need an expert to answer this...

Having a look at what's available on the internet, I can't see anything that implies they are equal, whether on Intel/Apple/Nvidia hardware?

https://arxiv.org/pdf/2303.17951 Have a look at the conclusion, written last year, for an example.

leman · May 9, 2024

ric22 said:
Having a look at what's available on the internet, I can't see anything that implies they are equal, whether on Intel/Apple/Nvidia hardware?

https://arxiv.org/pdf/2303.17951 Have a look at the conclusion, written last year, for an example.

That’s an entirely different topic. That paper examines the tradeoffs associated with precision of various weights. We instead were discussing hardware performance.

Incidentally, I agree that FP8 is likely not a very useful format all things considered. Nvidia is likely pushing these fractional formats because it allows them to claim higher performance numbers. Apple is more pragmatic. Their hardware appears to support FP16 and INT8, while providing support for unpacking quantized weights. The idea is that you are bandwidth limited anyway, so having 50 TFLOPS or 100 TFLOPS doesn’t really matter.

Aries79 · May 9, 2024

chucker23n1 said:
Clock seems to be slightly up from 4.1 to 4.4.

4.4 is in turbo mode, the base clock is 3.95

leman · May 9, 2024

Aries79 said:
4.4 is in turbo mode, the base clock is 3.95

“Turbo mode” is Intel nomenclature. Apple doesn’t really use these concepts.

name99 · May 9, 2024

ric22 said:
So, after all the AI hype and manipulated TOPS figures, the M4 neural engine is only approximately 5% faster than the one in the M3, when using the same level of precision to calculate the TOPS figures?!?

Alleged Apple M4 Geekbench scores show incremental improvement in machine learning over last gen

Incremental improvement over last gen

www.tomshardware.com

Do you want to UNDERSTAND this technology, or do you want to rant?
I have explained in multiple posts what the issue is with ANE (INT8 vs FP16, different parts of the ANE, performance on vision vs language, etc).

It's your choice. You can go read those and learn about the tech, or you can scream and shout about manipulation and hype.

name99 · May 9, 2024

chucker23n1 said:
Clock seems to be slightly up from 4.1 to 4.4.

The basic consensus right now, after a morning of frantic back and forth on Twitter and other platforms (I don't have time to give the details justifying this consensus) is that, as far as *CPU* goes

- clock is about 8% faster, presumably at essentially flat power (no mean feat)
- IPC gain is a few percent (very much dependent on exactly how you calculate things), nice but not spectacular
BUT
- addition of SME to the ISA...!!!

Many pieces of evidence that point to this last (totally unexpected) change.
If we look at this in more detail, and look between the lines, what SEEMS to be the case is that
(a) Apple and ARM negotiated some sort of compromise for SME that allows SME to execute on what is essentially AMX hardware. This means that SME is an outer product engine, and can potentially execute as an accelerator, not associated with a single core, just like AMX.

Practical meaning is that
- compiler can now target AMX/SME (don't have to go through Accelerate calls)
- no obvious reason that the AMX/SME hardware will be faster for matrix ops than was the previous AMX hardware (ie if you are already making Accelerate calls, I don't expect them to be much faster than the usual gen to gen speedups)

(b) SME comes with something (which no-one outside Apple/ARM seems to understand yet) called STREAMING SVE mode, which allows for various modifications to SVE. This seems to allow for a separate, different length, set of SVE registers. At the time this was announced (the name and pretty much nothing else) no-one could realy understand it. Now, what I think it means is
+ NEON on M4 gets bumped up to SVE. Quite possibly only as 128b SVE, possibly as 256b SVE. Either way, even if it's only 128b, it gets the compiler improvements that allow SVE to be a somewhat better compiler target than NEON (code that's a few percent faster and a few percent smaller)
+ STREAMING SVE allows the developer to execute 512b vector operations on SME/AMX. Uses the AMX register set. High latency, but great throughput. These vector ops already exist on AMX, but only via Accelerate calls.

So for vector code, the compiler can target SSVE (high throughput, high latency) or NEON/SVE. There will undoubtedly be various pragmas and flags to guide this, and a year of confusion till people figure out optimal patterns.

Interesting point is that GB6.3 is compiled with both SME and SVE. So why did it not get a speed boost from SVE?
Possible answers:
- it did, that's where the few percent IPC improvement of M4 comes from? If SVE is 128b, that's the sort of improvement level we would expect from using SVE128 rather than NEON.
- Apple SVE is 256b, which raises issues given NEONs 128b registers, and so an app has to somehow mark that it wants SVE [which will flip some CPU setting], otherwise it only gets NEON?

As far as Neural Engine goes, SOMETHING was added that dramatically improves the performance of language nets (as opposed to the earlier primary focus on vision nets).
This COULD be the indirect addressing that's referenced in a few recent ANE patents; or it could be the vector DSP that's referenced in two recent patents.
In both cases, the ultimate performance boost of this addition may be much higher; it may be that right now only one or two neural net layers within CoreML have been boosted to use this functionality, and more will follow.

GPU seems unchanged (in the sense that it's M3 GPU, as expected).

Display block obviously is boosted to handle Tandem OLED, with who knows what consequences for lesser screens.

Media block is the great unknown. The references to 8K support [when you compare M4 iPad Pro to M2 iPad Pro] make me suspect that encoding speed has been bumped up (possibly also with slightly better quality) for lesser formats like 4K or 2K h.265 (which were plenty fast on the M2, but waiting two minutes rather than four minutes is always nice!)

bluespider · May 9, 2024

name99 said:
The basic consensus right now, after a morning of frantic back and forth on Twitter and other platforms (I don't have time to give the details justifying this consensus) is that, as far as *CPU* goes

- clock is about 8% faster, presumably at essentially flat power (no mean feat)
- IPC gain is a few percent (very much dependent on exactly how you calculate things), nice but not spectacular
BUT
- addition of SME to the ISA...!!!

Many pieces of evidence that point to this last (totally unexpected) change.
If we look at this in more detail, and look between the lines, what SEEMS to be the case is that
(a) Apple and ARM negotiated some sort of compromise for SME that allows SME to execute on what is essentially AMX hardware. This means that SME is an outer product engine, and can potentially execute as an accelerator, not associated with a single core, just like AMX.

Practical meaning is that
- compiler can now target AMX/SME (don't have to go through Accelerate calls)
- no obvious reason that the AMX/SME hardware will be faster for matrix ops than was the previous AMX hardware (ie if you are already making Accelerate calls, I don't expect them to be much faster than the usual gen to gen speedups)

(b) SME comes with something (which no-one outside Apple/ARM seems to understand yet) called STREAMING SVE mode, which allows for various modifications to SVE. This seems to allow for a separate, different length, set of SVE registers. At the time this was announced (the name and pretty much nothing else) no-one could realy understand it. Now, what I think it means is
+ NEON on M4 gets bumped up to SVE. Quite possibly only as 128b SVE, possibly as 256b SVE. Either way, even if it's only 128b, it gets the compiler improvements that allow SVE to be a somewhat better compiler target than NEON (code that's a few percent faster and a few percent smaller)
+ STREAMING SVE allows the developer to execute 512b vector operations on SME/AMX. Uses the AMX register set. High latency, but great throughput. These vector ops already exist on AMX, but only via Accelerate calls.

So for vector code, the compiler can target SSVE (high throughput, high latency) or NEON/SVE. There will undoubtedly be various pragmas and flags to guide this, and a year of confusion till people figure out optimal patterns.

Interesting point is that GB6.3 is compiled with both SME and SVE. So why did it not get a speed boost from SVE?
Possible answers:
- it did, that's where the few percent IPC improvement of M4 comes from? If SVE is 128b, that's the sort of improvement level we would expect from using SVE128 rather than NEON.
- Apple SVE is 256b, which raises issues given NEONs 128b registers, and so an app has to somehow mark that it wants SVE [which will flip some CPU setting], otherwise it only gets NEON?

As far as Neural Engine goes, SOMETHING was added that dramatically improves the performance of language nets (as opposed to the earlier primary focus on vision nets).
This COULD be the indirect addressing that's referenced in a few recent ANE patents; or it could be the vector DSP that's referenced in two recent patents.
In both cases, the ultimate performance boost of this addition may be much higher; it may be that right now only one or two neural net layers within CoreML have been boosted to use this functionality, and more will follow.

GPU seems unchanged (in the sense that it's M3 GPU, as expected).

Display block obviously is boosted to handle Tandem OLED, with who knows what consequences for lesser screens.

Media block is the great unknown. The references to 8K support [when you compare M4 iPad Pro to M2 iPad Pro] make me suspect that encoding speed has been bumped up (possibly also with slightly better quality) for lesser formats like 4K or 2K h.265 (which were plenty fast on the M2, but waiting two minutes rather than four minutes is always nice!)

But am i going to notice?

Confused-User · May 9, 2024

name99 said:
The basic consensus right now[...]

Thanks for your summary. Excellent as always.

Apple Debuts Next-Generation M4 Chip

macrumors 6502a

macrumors 6502a

macrumors 6502

macrumors G3

macrumors G3

Suspended

Contributor

macrumors 6502a

macrumors member

macrumors 68020

macrumors G3

macrumors Core

macrumors 6502a

Suspended

macrumors 6502a

Suspended

macrumors Core

Suspended

macrumors Core

macrumors member

macrumors Core

macrumors 68020

macrumors 68020

macrumors 6502a

macrumors 6502a

Our Staff