M2 vs. M3 Buyer's Guide: How Much Better Really Is M3?

Torty · Mar 22, 2024

Why is the M3' Neural engine so much worse than the A17 pro? Source: Wikipedia

Bildschirmfoto 2024-03-22 um 18.08.29.png

Papanate · Mar 22, 2024

Torty said:
Why is the M3' Neural engine so much worse than the A17 pro? Source: Wikipedia
View attachment 2361771

Look at the bandwidt.

Torty · Mar 22, 2024

Papanate said:
Look at the bandwidt.

51.2 GB/s A17 pro crushes the M3 Max with 409.6 GB/s or what do you mean?

chucker23n1 · Mar 22, 2024

Torty said:
Why is the M3' Neural engine so much worse than the A17 pro? Source: Wikipedia
View attachment 2361771

Probably mostly because it's used heavily on iPhones, but not (yet) on Macs.

name99 · Mar 22, 2024

Torty said:
Why is the M3' Neural engine so much worse than the A17 pro? Source: Wikipedia
View attachment 2361771

This has been propagated all over the internet. I see absolutely zero evidence for an IMPORTANT difference here.
My guess is that there was a screwup by marketing.

Counting "operations" in an NPU is a vague business (like on a GPU, but even more so).
The original A12 ANE consisted of multiple multiply-adders, and you could count the "number of operations" as the number of multiply-adds that could be performed. This appears to be what Apple did (and BTW what they did for 8bit operations. The ANE can run in three modes, FP16, INT8 and a fixed point 8.8 mode. FP16 is the main [only?] case of interest to outsides, but the INT8 and fixed point modes appear to be used by Apple vision tasks like photo recognition, and still present even today.)

OK, that seems fairly simple EXCEPT that the A12 ANE has a bunch of extra hardware that goes beyond just multiply-add.
There is fancy hardware for streaming data into the ANE while transposing it. (We are not counting load/store as "ops").
There is fancy hardware for moving data around, and performing hardware loops, within the ANE. (We are not counting indexing and loop control as ops).
There is a "Post Processor" unit that performs a variety of additional tasks, from simple functions (ReLU look, absolute value) to data reductions (add all the items flowing through the ANE to calculate means, standard deviations, and so on). All this also isn't counted as "Ops".

With a later rev (I think A14) a whole "second half" to the ANE was added, called the "Planar Engine". This basically splits the ANE into two halves, one half that performs calculation-limited tasks (eg convolution, matrix-vector/matrix-matrix multiply, outer products) and one half that performs memory bandwidth limited tasks (pooling, element-by element tensor operations, gathering statistics etc).
You could effectively view this as doubling the performance except it's not clear quite the numbers are, and performance is really limited by whatever bandwidth your SoC offers.

Along with this, the memory front-end has gathered even fancier support, not just for transposition now, but for tensor reshaping, and for indirect lookups (which I *think* is mainly targeted at mapping sequences of words to sequences of semantic vectors).

Point is, if you go by the original definition of "Ops" as something like "multiply-adds performed by the MAC units" you get one number. If you go marketing wild you can get a much larger number adding up anything you can think of.

I can believe that someone in Marketing for the A17 indeed went wild, which is how that number got into the 2023 iPhone event; but that afterwards engineers went to speak to marketing and pointed out "this is stupid, it is demeaning, and it is not the way we at Apple do things", so that for all subsequent marketing (including the M3 marketing) we are back to counting in a way that's comparable (sort of) to the earlier devices.

Which one is "correct"? To ask the question reveals that you have understood NOTHING of what I have said. There is no correct answer. There is only more or less understanding of
- the problem to be solved (what types of operations are required to make an NN run fast) and
- the hardware provided (which keeps adding new functionality that does, indeed, make NN's run faster – but that doesn't add substantially to the count of multiply-adders).

If we look at the GB5 ML scores (probably not great, but best available data) we see
[I screwed these numbers up the first time and used the CPU numbers. I've never used the GB ML browser before.
Below I've added all three numbers - CPU, GPU, NPU since the set of all three is interesting]

A11 422 862 430 (ANE on A11 was not the "real" ANE. So I think this is actually running on CPU.)
A12 550 1150 1330 (probably some ANE use now, and ~matches CPU/GPU perf, but lower power) 5 TOPs
A13 687 1370 1700 (slight freq boost) 6 TOPS
A14 851 1580 2360 (Planar Engine?) also double ANE cores, so now 11 TOPs
A15 910 1870 2840 (Apple number is now 16 TOPs. I think just frequency boost)
A16 1070 2520 3230 (Apple number is now 17 TOPs)
A17 1360 2790 3640 (Depending on who you believe, either 35 TOPs or 18 TOPs)

[A11 ANE was a Lattice Digital design, not what became the Apple ANE, which was evolved from the Vision section of the Camera ISP. It was only used for FaceID and Memoji. It may have been a Plan B effort; FaceID needed some sort of appropriate hardware, and maybe Apple was hoping to have spun off the Vision part of the ISP to a separate ANE block, but it wasn't available in time? The A12 and later are the "real" ANE's based on Apple design from the start, and the ones that run third party code, A11 as far as I know does NOT run 3rd party code.]

[Interesting how the GPU score is uniformly close to about 2x the CPU score. And the ANE score is ~3x the CPU score.
Also I did not show different models, I tried to just use "phone average" but, especially as we get to later models, you can really see the important of memory bandwidth rather than compute as iPad GPU scores are substantially higher than phone GPU scores.]

It's hard to be sure of anything in this space because we don't know how "well compiled" the GB ML benchmark is.
Neural networks start life as a fairly abstract representation in PyTorch or some similar language and then go through a few "compiling" stages.
An important point which almost no-one understands (because this space is so new) is that
- it's frequently possibly to rearrange details in the layers [eg data reordering, or kernel compression] in a way that makes no difference to the end results but has massive performance implications. This is what you are seeing when Apple publishes an article saying they made Transformers much more efficient on ANE, or when people talk about making LLAMA run faster on M3. It's mainly data formats, avoiding tensor reshaping (just because the memory front-end can do it doesn't mean it's free!) and kernel quantization.

- some layers in a net may look trivial to the author, but for one reason or another they cannot be executed on the ANE. Which then messes things up; the compiler may decide to run the whole thing on the GPU instead, or may decide to run the first half on the ANE, the problematic layer on the CPU, then the rest on the ANE.
To complicate issues every year more ("hidden") functionality is added to the ANE, but again people who don't understand what is going on don't track this. So they try some layer on their M1- (or even A12-) class ANE, see that it doesn't run (the ML compiler forces that layer to CPU) and assume it's always going to be that way; whereas each year the new ANE can, in fact, support a wider range of layers.
Like Metal, CoreML gains new functionality each year and, like Metal, much of that new functionality (once you know how to read into it) is actually a way of exposing the hardware functionality of the newest chips. But if you're testing this new functionality of an M1, you're going to think "that sucks" or at least "that's disappointing, it only exploits the GPU".

My guess is the GB numbers reflect an author who has no idea what really to do beyond downloading and compiling some models from HuggingFace. So they show us *something* of what Apple systems (the compiler, the GPU and the ANE) can do, but they're not the perfect ANE benchmark we'd like.
Even so, they do show steady annual improvements, that big improvement often occur along with small jumps in the OPs number (meaning a more substantial amount of none FMA work, like pooling) is being captured by the rest of the ANE, and they show a jump (but not massive) with the A17, my guess is reflecting these on-going improvements to the rest of the ANE, not a doubling in FMAC performance.

And, BTW, ANE is very much like Apple GPU a few years ago. A lot of promise there already, but also wide-open fields, so much that can be improved over the next few years – not because Apple have been dumb so far, but just because it's a whole new way of doing things, and everything takes time...

There's also the whole question of training. GB ML only tests inference. The ANE patents claim it is also optimized for training, but I have never seen any work addressing this. Perhaps right now it's something only Apple can really access, given the constraints of Core ML?
One of my interests as something barely ever considered in the mainstream media is the extent to which LLMs can be tailored to the user. I don't want my writing coach to make me sound like the average human, and I want the AI summarizing articles for me, or answering my questions, to speak to MY concerns, not the concerns of the average human. I suspect there's scope to engage in ongoing tuning of an LLM on a device, based on everything I write, everything I read, etc. If anyone can do this, it would be Apple. All this *may* mean they have an interest in the performance of ongoing on-device training... We'll see something, anyway, at WWDC.

Torty · Mar 22, 2024

name99 said:
This has been propagated all over the internet. I see absolutely zero evidence for an IMPORTANT difference here.
My guess is that there was a screwup by marketing.

... (snip)...

Thank you so much for your time and detailed explanation. I only can follow a small part of it but I know now more than before. 🙏🙏🙏

Allen_Wentz · Mar 22, 2024

McWetty said:
It would be wise for Apple to differentiate their chips based on intended use. I do photo editing and videography so having a chip with better raw performance is important. My M1M Studio is perfect for this. But someone that does coding may benefit more from a different chip arrangement. It’s not just “more better faster” anymore. Chips are becoming more use-case specific (AI, graphics, ray tracing, etc) and I’d like to see that reflected in Apple SOCs.

What _exactly_ do you mean by "...I’d like to see that reflected in Apple SOCs."? We already do observe use-case specific performance differences with M3 so I do not understand your question.

theorist9 · Mar 22, 2024

McWetty said:
It would be wise for Apple to differentiate their chips based on intended use. I do photo editing and videography so having a chip with better raw performance is important. My M1M Studio is perfect for this. But someone that does coding may benefit more from a different chip arrangement. It’s not just “more better faster” anymore. Chips are becoming more use-case specific (AI, graphics, ray tracing, etc) and I’d like to see that reflected in Apple SOCs.

That level of customizability would be nice. For instance, I'm sure there are those who would like Ultra-level amounts of RAM but don't need the Ultra's numerous CPU cores.

Alas, Apple's moving in the opposite direction (in part because of their UMA), where max RAM, #CPU cores, and #GPU cores increase in lock step. For instance, the Intel MP allowed one to independently choose from among several different CPU and GPU options, and offered a wide range of RAM. By contrast, the AS MP has only one CPU option and two GPU options, and far fewer RAM choices.

And unless Apple comes up with separate SoC's for the CPU and GPU, that's probably not going to change.

ric22 · Mar 22, 2024

theorist9 said:
That level of customizability would be nice. For instance, I'm sure there are those who would like Ultra-level amounts of RAM but don't need the Ultra's numerous CPU cores.

Alas, Apple's moving in the opposite direction (in part because of their UMA), where max RAM, #CPU cores, and #GPU cores increase in lock step. For instance, the Intel MP allowed one to independently choose from among several different CPU and GPU options, and offered a wide range of RAM. By contrast, the AS MP has only one CPU option and two GPU options, and far fewer RAM choices.

And unless Apple come up with separate SoC's for the CPU and GPU, that's probably not going to change.

It is a pity. It results in many buyers spending more than they'd otherwise need to, or results in people that just need one of those three looking elsewhere.

Chuckeee · Mar 22, 2024

theorist9 said:
That level of customizability would be nice. For instance, I'm sure there are those who would like Ultra-level amounts of RAM but don't need the Ultra's numerous CPU cores.

Alas, Apple's moving in the opposite direction (in part because of their UMA), where max RAM, #CPU cores, and #GPU cores increase in lock step. For instance, the Intel MP allowed one to independently choose from among several different CPU and GPU options, and offered a wide range of RAM. By contrast, the AS MP has only one CPU option and two GPU options, and far fewer RAM choices.

And unless Apple comes up with separate SoC's for the CPU and GPU, that's probably not going to change.

Although, if and when chiplets get integrated into the M fabrication process there could eventually be Some relief from simultaneous lockstep advancement of max RAM, #CPU cores, and #GPU cores.

name99 · Mar 22, 2024

name99 said:
This has been propagated all over the internet. I see absolutely zero evidence for an IMPORTANT difference here.
My guess is that there was a screwup by marketing.

Counting "operations" in an NPU is a vague business (like on a GPU, but even more so).
The original A12 ANE consisted of multiple multiply-adders, and you could count the "number of operations" as the number of multiply-adds that could be performed. This appears to be what Apple did (and BTW what they did for 8bit operations. The ANE can run in three modes, FP16, INT8 and a fixed point 8.8 mode. FP16 is the main [only?] case of interest to outsides, but the INT8 and fixed point modes appear to be used by Apple vision tasks like photo recognition, and still present even today.)

OK, that seems fairly simple EXCEPT that the A12 ANE has a bunch of extra hardware that goes beyond just multiply-add.
There is fancy hardware for streaming data into the ANE while transposing it. (We are not counting load/store as "ops").
There is fancy hardware for moving data around, and performing hardware loops, within the ANE. (We are not counting indexing and loop control as ops).
There is a "Post Processor" unit that performs a variety of additional tasks, from simple functions (ReLU look, absolute value) to data reductions (add all the items flowing through the ANE to calculate means, standard deviations, and so on). All this also isn't counted as "Ops".

With a later rev (I think A14) a whole "second half" to the ANE was added, called the "Planar Engine". This basically splits the ANE into two halves, one half that performs calculation-limited tasks (eg convolution, matrix-vector/matrix-matrix multiply, outer products) and one half that performs memory bandwidth limited tasks (pooling, element-by element tensor operations, gathering statistics etc).
You could effectively view this as doubling the performance except it's not clear quite the numbers are, and performance is really limited by whatever bandwidth your SoC offers.

Along with this, the memory front-end has gathered even fancier support, not just for transposition now, but for tensor reshaping, and for indirect lookups (which I *think* is mainly targeted at mapping sequences of words to sequences of semantic vectors).

Point is, if you go by the original definition of "Ops" as something like "multiply-adds performed by the MAC units" you get one number. If you go marketing wild you can get a much larger number adding up anything you can think of.

I can believe that someone in Marketing for the A17 indeed went wild, which is how that number got into the 2023 iPhone event; but that afterwards engineers went to speak to marketing and pointed out "this is stupid, it is demeaning, and it is not the way we at Apple do things", so that for all subsequent marketing (including the M3 marketing) we are back to counting in a way that's comparable (sort of) to the earlier devices.

Which one is "correct"? To ask the question reveals that you have understood NOTHING of what I have said. There is no correct answer. There is only more or less understanding of
- the problem to be solved (what types of operations are required to make an NN run fast) and
- the hardware provided (which keeps adding new functionality that does, indeed, make NN's run faster – but that doesn't add substantially to the count of multiply-adders).

If we look at the GB5 ML scores (probably not great, but best available data) we see
[I screwed these numbers up the first time and used the CPU numbers. I've never used the GB ML browser before.
Below I've added all three numbers - CPU, GPU, NPU since the set of all three is interesting]

A11 422 862 430 (ANE on A11 was not the "real" ANE. So I think this is actually running on CPU.)
A12 550 1150 1330 (probably some ANE use now, and ~matches CPU/GPU perf, but lower power) 5 TOPs
A13 687 1370 1700 (slight freq boost) 6 TOPS
A14 851 1580 2360 (Planar Engine?) also double ANE cores, so now 11 TOPs
A15 910 1870 2840 (Apple number is now 16 TOPs. I think just frequency boost)
A16 1070 2520 3230 (Apple number is now 17 TOPs)
A17 1360 2790 3640 (Depending on who you believe, either 35 TOPs or 18 TOPs)

[A11 ANE was a Lattice Digital design, not what became the Apple ANE, which was evolved from the Vision section of the Camera ISP. It was only used for FaceID and Memoji. It may have been a Plan B effort; FaceID needed some sort of appropriate hardware, and maybe Apple was hoping to have spun off the Vision part of the ISP to a separate ANE block, but it wasn't available in time? The A12 and later are the "real" ANE's based on Apple design from the start, and the ones that run third party code, A11 as far as I know does NOT run 3rd party code.]

[Interesting how the GPU score is uniformly close to about 2x the CPU score. And the ANE score is ~3x the CPU score.
Also I did not show different models, I tried to just use "phone average" but, especially as we get to later models, you can really see the important of memory bandwidth rather than compute as iPad GPU scores are substantially higher than phone GPU scores.]

It's hard to be sure of anything in this space because we don't know how "well compiled" the GB ML benchmark is.
Neural networks start life as a fairly abstract representation in PyTorch or some similar language and then go through a few "compiling" stages.
An important point which almost no-one understands (because this space is so new) is that
- it's frequently possibly to rearrange details in the layers [eg data reordering, or kernel compression] in a way that makes no difference to the end results but has massive performance implications. This is what you are seeing when Apple publishes an article saying they made Transformers much more efficient on ANE, or when people talk about making LLAMA run faster on M3. It's mainly data formats, avoiding tensor reshaping (just because the memory front-end can do it doesn't mean it's free!) and kernel quantization.

- some layers in a net may look trivial to the author, but for one reason or another they cannot be executed on the ANE. Which then messes things up; the compiler may decide to run the whole thing on the GPU instead, or may decide to run the first half on the ANE, the problematic layer on the CPU, then the rest on the ANE.
To complicate issues every year more ("hidden") functionality is added to the ANE, but again people who don't understand what is going on don't track this. So they try some layer on their M1- (or even A12-) class ANE, see that it doesn't run (the ML compiler forces that layer to CPU) and assume it's always going to be that way; whereas each year the new ANE can, in fact, support a wider range of layers.
Like Metal, CoreML gains new functionality each year and, like Metal, much of that new functionality (once you know how to read into it) is actually a way of exposing the hardware functionality of the newest chips. But if you're testing this new functionality of an M1, you're going to think "that sucks" or at least "that's disappointing, it only exploits the GPU".

My guess is the GB numbers reflect an author who has no idea what really to do beyond downloading and compiling some models from HuggingFace. So they show us *something* of what Apple systems (the compiler, the GPU and the ANE) can do, but they're not the perfect ANE benchmark we'd like.
Even so, they do show steady annual improvements, that big improvement often occur along with small jumps in the OPs number (meaning a more substantial amount of none FMA work, like pooling) is being captured by the rest of the ANE, and they show a jump (but not massive) with the A17, my guess is reflecting these on-going improvements to the rest of the ANE, not a doubling in FMAC performance.

And, BTW, ANE is very much like Apple GPU a few years ago. A lot of promise there already, but also wide-open fields, so much that can be improved over the next few years – not because Apple have been dumb so far, but just because it's a whole new way of doing things, and everything takes time...

There's also the whole question of training. GB ML only tests inference. The ANE patents claim it is also optimized for training, but I have never seen any work addressing this. Perhaps right now it's something only Apple can really access, given the constraints of Core ML?
One of my interests as something barely ever considered in the mainstream media is the extent to which LLMs can be tailored to the user. I don't want my writing coach to make me sound like the average human, and I want the AI summarizing articles for me, or answering my questions, to speak to MY concerns, not the concerns of the average human. I suspect there's scope to engage in ongoing tuning of an LLM on a device, based on everything I write, everything I read, etc. If anyone can do this, it would be Apple. All this *may* mean they have an interest in the performance of ongoing on-device training... We'll see something, anyway, at WWDC.

Followup:

Look at these values:

Mac15,12 vs iPhone 15 Pro - Geekbench

View those in parallel with (M3 Pro)

MacBook Pro (16-inch, Nov 2023) vs iPhone 15 Pro - Geekbench

and (M3 Max)

MacBook Pro (16-inch, Nov 2023) vs iPhone 15 Pro - Geekbench

Once again it's hard to be sure what's going on, and what code is running where. It doesn't help that each combination I found is using different versions of the OS and it's possible (given how fast ML is moving) that in each new OS build important new optimizations have been added?

Probably the F32 code variants are running on the GPU in each case, and it's no surprise that they scale (though surprisingly weakly) by the GPU size. Some (like Pose Estimation) seem to scale like CPU P-core core count?
So F32 seems to be a mixed bag, probably executed partially on GPU, partially on (AMX?), perhaps even partially on CPU, and with random scaling.

F16 I would assume is mostly on the ANE and the results seem to agree with that, and with a common ANE on all the devices, with just a 10% or 20% boost from better memory bandwidth.

I8 is mostly also, I assume on ANE. I didn't think anyone outside Apple Vision really cared about I8, but I guess I was wrong – seems lots of benchmarks at least use it.
Another interesting point is that the I8 and F16 numbers are so similar. I don't know if this is because I8 is actually 8.8 fixed point (in which case we expect each operation to be executed twice, as upper and lower halves), or if there are actually as many FP16 and I8 MAC units. My assumption was that there are 256 MAC units in each core, and one MAC unit holds two int8 arithmetic engines and one fp16 arithmetic engine, but it looks like this assumption might be mistaken. Sorry I'm so vague – we all know very little about this and are trying to fit everything together! One reason I'm gathering and writing all these numbers is to check and (hopefully in the right direction!) update my assumptions and guesses.

Now consider a different direction:
The first set of benchmarks (all the way to Image Super-Resolution) are essentially vision/image tasks and (unsurprisingly for hardware that began life as a camera-attached vision unit) ANE does very well.
The last two (times three) tasks are text, and have very different performance. But they also have values that are ALL OVER THE PLACE for every possible combination!!!
The one thing that's pretty constant is that text at F32 takes about the same amount of time on every chip. Which suggests (?) that it's routed to P-AMX, and for some reason (good reason? just no optimization time yet?) only one of the Max's two P-AMX units is used.
Then what? Text Classification F16 and I8 text go to the GPU and scale kinda like what we'd expect from the GPU?
Machine Translation F16 and I8 also go to the GPU, but GPU scaling for these tasks is really really bad? Or they have to keep moving layers between GPU and AMX, and the synchronization costs kill you?

Two things that do seem clear:
- there's zero evidence for a super ANE in the A17 that can beat the M3
- Apple's current text story does not seem very directed to the ANE. Which is somewhat scary insofar as text is the future, though AMX and GPU can pick up the slack temporarily.

This confirms my analysis from the patents. The patents show that (even as of the most recent patents) the Apple design can, among other things, do a really good job of multiplying a SIGNAL matrix against a KERNEL vector. This is, of course, exactly what you want for convolution and most vision tasks (and the rest, like pooling, are also easily handled).
Unfortunately what the design as-is cannot handle well is multiplying a SIGNAL vector against a KERNEL matrix, as frequently required by language.

I don't believe this is an intrinsic flaw; in fact I believe it can be fixed fairly easily (store kernel in the common buffer pool rather than in separate storage, and duplicate the kernel extract block along both pathways – think of the two pathways not as "signal" and "kernel" but as "matrix" and "vector"). In a perfect world, we'd already see these (along with a set of other "fairly easy" tweaks that can be added to substantially improve language performance, in the A18/M4 generation. Six months or so till we see how that turns out

name99 · Mar 22, 2024

name99 said:
Followup:

Look at these values:

Mac15,12 vs iPhone 15 Pro - Geekbench

View those in parallel with (M3 Pro)

MacBook Pro (16-inch, Nov 2023) vs iPhone 15 Pro - Geekbench

and (M3 Max)

MacBook Pro (16-inch, Nov 2023) vs iPhone 15 Pro - Geekbench

Once again it's hard to be sure what's going on, and what code is running where. It doesn't help that each combination I found is using different versions of the OS and it's possible (given how fast ML is moving) that in each new OS build important new optimizations have been added?

Probably the F32 code variants are running on the GPU in each case, and it's no surprise that they scale (though surprisingly weakly) by the GPU size. Some (like Pose Estimation) seem to scale like CPU P-core core count?
So F32 seems to be a mixed bag, probably executed partially on GPU, partially on (AMX?), perhaps even partially on CPU, and with random scaling.

F16 I would assume is mostly on the ANE and the results seem to agree with that, and with a common ANE on all the devices, with just a 10% or 20% boost from better memory bandwidth.

I8 is mostly also, I assume on ANE. I didn't think anyone outside Apple Vision really cared about I8, but I guess I was wrong – seems lots of benchmarks at least use it.
Another interesting point is that the I8 and F16 numbers are so similar. I don't know if this is because I8 is actually 8.8 fixed point (in which case we expect each operation to be executed twice, as upper and lower halves), or if there are actually as many FP16 and I8 MAC units. My assumption was that there are 256 MAC units in each core, and one MAC unit holds two int8 arithmetic engines and one fp16 arithmetic engine, but it looks like this assumption might be mistaken. Sorry I'm so vague – we all know very little about this and are trying to fit everything together! One reason I'm gathering and writing all these numbers is to check and (hopefully in the right direction!) update my assumptions and guesses.

Now consider a different direction:
The first set of benchmarks (all the way to Image Super-Resolution) are essentially vision/image tasks and (unsurprisingly for hardware that began life as a camera-attached vision unit) ANE does very well.
The last two (times three) tasks are text, and have very different performance. But they also have values that are ALL OVER THE PLACE for every possible combination!!!
The one thing that's pretty constant is that text at F32 takes about the same amount of time on every chip. Which suggests (?) that it's routed to P-AMX, and for some reason (good reason? just no optimization time yet?) only one of the Max's two P-AMX units is used.
Then what? Text Classification F16 and I8 text go to the GPU and scale kinda like what we'd expect from the GPU?
Machine Translation F16 and I8 also go to the GPU, but GPU scaling for these tasks is really really bad? Or they have to keep moving layers between GPU and AMX, and the synchronization costs kill you?

Two things that do seem clear:
- there's zero evidence for a super ANE in the A17 that can beat the M3
- Apple's current text story does not seem very directed to the ANE. Which is somewhat scary insofar as text is the future, though AMX and GPU can pick up the slack temporarily.

This confirms my analysis from the patents. The patents show that (even as of the most recent patents) the Apple design can, among other things, do a really good job of multiplying a SIGNAL matrix against a KERNEL vector. This is, of course, exactly what you want for convolution and most vision tasks (and the rest, like pooling, are also easily handled).
Unfortunately what the design as-is cannot handle well is multiplying a SIGNAL vector against a KERNEL matrix, as frequently required by language.

I don't believe this is an intrinsic flaw; in fact I believe it can be fixed fairly easily (store kernel in the common buffer pool rather than in separate storage, and duplicate the kernel extract block along both pathways – think of the two pathways not as "signal" and "kernel" but as "matrix" and "vector"). In a perfect world, we'd already see these (along with a set of other "fairly easy" tweaks that can be added to substantially improve language performance, in the A18/M4 generation. Six months or so till we see how that turns out

Okay, once again I misunderstood how the GB ML browser frameworks.
God, this web UI was truly designed by a moron.
[If you think that's too harsh, look at the way the search results page of search tells you the Inference Framework that was used, but there is no way to either filter to force the Framework, nor is which Framework was used displayed on the page when you click on it or compare two results...]

Give me a few minutes to recalibrate like with like, now that I figured out how to force ANE results. Let's see if anything changes.

name99 · Mar 22, 2024

name99 said:
Okay, once again I misunderstood how the GB ML browser frameworks.
God, this web UI was truly designed by a moron.
[If you think that's too harsh, look at the way the search results page of search tells you the Inference Framework that was used, but there is no way to either filter to force the Framework, nor is which Framework was used displayed on the page when you click on it or compare two results...]

Give me a few minutes to recalibrate like with like, now that I figured out how to force ANE results. Let's see if anything changes.

OK, now that we have (as far as possible) forced thing to the ANE what do we see?
M3

Mac15,12 vs iPhone 15 Pro - Geekbench

M3 Pro

MacBook Pro (14-inch, Nov 2023) vs iPhone 15 Pro - Geekbench

M3 Max

MacBook Pro (14-inch, Nov 2023) vs iPhone 15 Pro - Geekbench

We still expect that all F32 results route other than the ANE.
F32 results that look much better on M3 than 17 are probably going through the GPU (2x the GPU cores).

I'm guessing that Text Classification and Machine Translation F32 are going through AMX (on the Pro seeing some advantage from bandwidth and larger L2, on the Max maybe seeing two AMX units being used?)

Machine Translation F16 is basically 2x for M3, and Pro, and Max. Is that purely bandwidth? It's running on the ANE (same hardware in each case) but on the phone it is limited by bandwidth, on M3 and higher it is limited by ANE computation?

What about Text Classification F16? Seems like the same OS version in all three cases, but M3 is slightly faster, Pro and Max are massively faster. Cache effect, like the M3 SLC can not fully hold the working set, but still holds more than the A17, and then the Pro and Max can easily hold the working set?

Still so much so unclear :-(

chucker23n1 · Mar 22, 2024

theorist9 said:
That level of customizability would be nice. For instance, I'm sure there are those who would like Ultra-level amounts of RAM but don't need the Ultra's numerous CPU cores.

Yeah, as a software dev, I wouldn't mind something Max-like for the additional RAM but even the Pro's GPU cores are overkill. The Max's certainly would be.

But, that increases the complexity of their SoC line-up.

I don't know about the bean counter math on it. What makes more revenues: requiring customers who really need more than 36 GiB RAM to spend another $700 to get the Max with 48? Or not requiring that, offering an in-between option, and having more people in total opt for that (but also having the additional R&D cost of designing that chip in the first place)?

I'm sure they've done models on that, but I wouldn't be surprised if the answer isn't more complicated/cynical than "we'd rather have a simpler set of SoC choices".

theorist9 · Mar 22, 2024

chucker23n1 said:
Yeah, as a software dev, I wouldn't mind something Max-like for the additional RAM but even the Pro's GPU cores are overkill. The Max's certainly would be.

But, that increases the complexity of their SoC line-up.

I don't know about the bean counter math on it. What makes more revenues: requiring customers who really need more than 36 GiB RAM to spend another $700 to get the Max with 48? Or not requiring that, offering an in-between option, and having more people in total opt for that (but also having the additional R&D cost of designing that chip in the first place)?

I'm sure they've done models on that, but I wouldn't be surprised if the answer isn't more complicated/cynical than "we'd rather have a simpler set of SoC choices".

Yeah, I didn't mention that explicitly in my post, but I also think Apple's own preference is for a simple lineup with limited options.

ric22 · Apr 30, 2024

name99 said:
Point is, if you go by the original definition of "Ops" as something like "multiply-adds performed by the MAC units" you get one number. If you go marketing wild you can get a much larger number adding up anything you can think of.

I can believe that someone in Marketing for the A17 indeed went wild, which is how that number got into the 2023 iPhone event; but that afterwards engineers went to speak to marketing and pointed out "this is stupid, it is demeaning, and it is not the way we at Apple do things", so that for all subsequent marketing (including the M3 marketing) we are back to counting in a way that's comparable (sort of) to the earlier devices.

@Torty

Torty · Apr 30, 2024

ric22 said:
@Torty

I see and I followed the thread but when apple says it's "2x as fast" then it's more than about only numbers. It's about performance.

ric22 · Apr 30, 2024

Torty said:
I see and I followed the thread but when apple says it's "2x as fast" then it's more than about only numbers. It's about performance.

But what if that was just the advertisers getting carried away after calculating it differently?

Edit: Plus, where is the A series to M series comparison you mentioned in the other thread?

Search

Search

M2 vs. M3 Buyer's Guide: How Much Better Really Is M3?

Torty

macrumors 65816

Papanate

macrumors 6502

Torty

macrumors 65816

chucker23n1

macrumors G3

name99

macrumors 68020

Torty

macrumors 65816

Allen_Wentz

macrumors 68030

theorist9

macrumors 68040

ric22

macrumors 68020

Chuckeee

macrumors 68000

name99

macrumors 68020

name99

macrumors 68020

name99

macrumors 68020

chucker23n1

macrumors G3

theorist9

macrumors 68040

ric22

macrumors 68020

Torty

macrumors 65816

ric22

macrumors 68020

Our Staff