[CPU only] Apple M1/2(Max/Ultra) (TSMC 5nm) vs AMD Zen 4 (TSMC 5nm) - Technical Analysis

pshufd · Nov 24, 2023

leman said:
You can do this on a stationary computer. But anything with a display and a battery is a problem…

Yup.

But you may get thermal throttling with a laptop too and I'd rather get numbers of what the capabilities of the system are.

MapleBeercules · Nov 24, 2023

Comparing M3 to anything right now is a bad comparison...
AMD 7000 was built on a known processor architecture, the m3 was built on a new architecture with terrible results both in yields and in quality. When TSMC switches over to N3E and apple makes a chip on that node, it will smoke anything AMD has on their roadmap for years to come.

N3B which all M3 processors are based upon is pure crap, infact apple is the only retailer who accepted any N3B products every other company turn down N3B because it was pure crap.

MRMSFC · Nov 24, 2023

MapleBeercules said:
Comparing M3 to anything right now is a bad comparison...
AMD 7000 was built on a known processor architecture, the m3 was built on a new architecture with terrible results both in yields and in quality. When TSMC switches over to N3E and apple makes a chip on that node, it will smoke anything AMD has on their roadmap for years to come.

N3B which all M3 processors are based upon is pure crap, infact apple is the only retailer who accepted any N3B products every other company turn down N3B because it was pure crap.

As a resident Apple Silicon fanboy;

It’s perfectly valid to compare two current competing products with each other. What you’ve suggested is like what another user suggested but in reverse (that comparing Intel v. Apple Silicon isn’t fair because of process node).

We should hold every product to the same standards.

name99 · Nov 24, 2023

Xiao_Xi said:
Statisticians use the standard deviation, not the percentage, to establish whether two points are significantly different or not.

Seriously dude?
OK, let me be very clear. When I say noise I mean there is nothing TECHNICALLY interesting in such small differences.
If you find such differences fascinating for whatever reason, go for it.

But don't be surprised when other people are simply UNINTERESTED in your trumpeting such numbers. They'd not interesting for clarifying tech differences between designs. They're not interesting for deciding to buy a new machine.

Xiao_Xi · Dec 5, 2023

HotChips just uploaded the presentations from the last conference.

Advance Program

A Symposium on High Performance Chips

www.hotchips.org

AMD made two presentations that may be interesting for this thread.
- AMD Next Generation “Zen 4” Core and 4th Gen AMD EPYCTM 9004 Server CPU

- AMD Ryzen 7040 Series Mobile Processor

Xiao_Xi · Feb 29, 2024

Can anyone confirm and explain this comment?

The chips you're used to have had decades of optimization to their automatic memory prefetching. Apples M1 etc do even better (they can speculate prefetch through a pointer which can make lots of data-structures a ton faster. Also for really high performance on x86, prefetching can often be a ~30% boost.

The chips you're used to have had decades of optimization to their automatic mem... | Hacker News

news.ycombinator.com

leman · Feb 29, 2024

Xiao_Xi said:
Can anyone confirm and explain this comment?

The chips you're used to have had decades of optimization to their automatic mem... | Hacker News

news.ycombinator.com

For fastest performance, you want your data to be in cache. Cache usually works in small blocks (e.g. 64 bytes). This means that if you are processing a sequential list of data elements, only the first of the 64-byte block will actually result in a DRAM access — entire cache block is loaded and subsequent n elements can be loaded from fast cache. Even better though, if you know that you will be processing a bunch of such elements, it makes sense for the CPU to load the relevant blocks into the cache even before you get to processing the relevant item, this reduces waiting times. A while ago we had dedicated prefetch hints for this (e.g. before doing a large memory read you could instruct the CPU that you will do it, prompting it to start loading the data from DRAM into the cache). Nowadays this is done with automatic prefetches that try to learn your access pattern and prefetch data accordingly. Detecting linear access is simple (e.g. if the CPU sees that you have accessed N subsequent locations in memory it can assume you are doing array processing and start fetching ahead). Apple goes one step further and also prefetches indirect accesses — that is, if you are processing an array of pointers — it will detect it and start loading the data at the subsequent pointer addresses.

Sydde · Feb 29, 2024

leman said:
Nowadays this is done with automatic prefetches that try to learn your access pattern and prefetch data accordingly.

There is in fact still an instruction for explicit memory prefetch, because maybe it is sometimes still needed. I am not seeing any flush/invalidate instructions (though with all the layers of caching, that would be somewhat fraught); perhaps those are effected through MSR?

name99 · Mar 1, 2024

Sydde said:
There is in fact still an instruction for explicit memory prefetch, because maybe it is sometimes still needed. I am not seeing any flush/invalidate instructions (though with all the layers of caching, that would be somewhat fraught); perhaps those are effected through MSR?

The DC instruction has modifiers that will perform a wide range of cache maintenance operations.

Documentation – Arm Developer

developer.arm.com

Prefetching on Apple chips is in fact extremely sophisticated. I'd be surprised to see a real-world example of a data pattern that's both predictable enough for SW prefetch to be worthwhile, but isn't caught by one of the many Apple hardware prefetchers.

Sydde · Mar 9, 2024

name99 said:
I'd be surprised to see a real-world example of a data pattern that's both predictable enough for SW prefetch to be worthwhile, but isn't caught by one of the many Apple hardware prefetchers.

Well, look at DCZVA: the program tells the processor, I am going to fill this whole line with stuff, so zero it out and don't bother to load it. That is just excellent.

name99 · Mar 10, 2024

Sydde said:
Well, look at DCZVA: the program tells the processor, I am going to fill this whole line with stuff, so zero it out and don't bother to load it. That is just excellent.

If you are calling DC ZVA a *prefetch* instruction then I'm out of this conversation.
You're obviously more interested in "winning" debate games by playing stupid word tricks than in understanding technology.

Sydde · Mar 10, 2024

name99 said:
If you are calling DC ZVA a *prefetch* instruction then I'm out of this conversation.
You're obviously more interested in "winning" debate games by playing stupid word tricks than in understanding technology.

No, it is not a prefetch, it is a do-not-fetch, because the program only wants to write. It saves the fetch cycle that would normally happen when a program starts writing stuff. Of course, AS might well have that in their memory optimization logic, so that a program would not need to issue the instruction at all. In fact, I would not at all be surprised if the other designs, including x86, have it as well.

Search

Search

[CPU only] Apple M1/2(Max/Ultra) (TSMC 5nm) vs AMD Zen 4 (TSMC 5nm) - Technical Analysis

pshufd

macrumors G3

MapleBeercules

Cancelled

MRMSFC

macrumors 6502

name99

macrumors 68020

Xiao_Xi

macrumors 68000

Advance Program

Xiao_Xi

macrumors 68000

The chips you're used to have had decades of optimization to their automatic mem... | Hacker News

leman

macrumors Core

The chips you're used to have had decades of optimization to their automatic mem... | Hacker News

Sydde

macrumors 68030

name99

macrumors 68020

Documentation – Arm Developer

Sydde

macrumors 68030

name99

macrumors 68020

Sydde

macrumors 68030

Our Staff