ARM based MacBook Pro discussion [merged]

Unregistered 4U · Mar 28, 2021

Macbookprodude said:
I liked Apple of the 2000s better.

Of course, some would say Apple of the 1980’s was better.

leman · Mar 28, 2021

Macbookprodude said:
Don’t get me wrong - I love RISC whether PowerPC(I run a few of those for file serving) and ARM(13 inch is bad on my eyes). I do hate Intel though.. but I want to get more proof that M1 is as fast as they say it is. I am a pro-RISC user, not CISC.

CPU performance is not religion. It’s tangible and measurable, and M1 has been analyzed in depth. I have no idea what kind of proof you are looking for, but there should be sufficient information out there to help you form an opinion.

Also, what is with this RISC/CISC discourse? These labels have very little relevance to modern CPUs. ARM v8 is in some ways more CISC than x86, and both modern x86 and modern ARM CPUs are essentially RISC machines under the hood. We should just retire these notions as utterly pointless and look at actually relevant architectural properties instead.

Macbookprodude · Mar 28, 2021

I guess I am a traditionalist from the old school of cpu architectures. I was in college when the PPC G3/MAINLY G4 era came out.

09872738 · Mar 28, 2021

leman said:
x86 .... essentially RISC machines under the hood.

Sorry, that is incorrect. x86 is in no way RISC-y; that‘s just a marketing plot by Intel making that up.@cmaier has explained that in other threads to quite some extent

thekev · Mar 28, 2021

09872738 said:
Sorry, that is incorrect. x86 is in no way RISC-y; that‘s just a marketing plot by Intel making that up.@cmaier has explained that in other threads to quite some extent

Both of them commonly implemented by decoding more complicated instructions into micro-op sequences. Both implement fused operations like FMA. While it's unrelated to the exact taxonomy, both have significant segmentation. For example, ARM's ISA, while considered a RISC type, has the split between Neon and Helium, due to the range of hardware it covers. Intel has a whole history of introducing weird opcodes.

thekev · Mar 28, 2021

09872738 said:
Sorry, that is incorrect. x86 is in no way RISC-y; that‘s just a marketing plot by Intel making that up.@cmaier has explained that in other threads to quite some extent

Didn't he just claim ARM and x86 are nothing alike? ARM itself isn't that similar to a classic RISC architecture. It seems pretty strained to call instruction sets with FMA and saturating arithmetic instructions "reduced".

09872738 · Mar 28, 2021

thekev said:
Didn't he just claim ARM and x86 are nothing alike? ARM itself isn't that similar to a classic RISC architecture. It seems pretty strained to call instruction sets with FMA and saturating arithmetic instructions "reduced".

cmaier said:
The microops are not the same as RISC, at least because they are not independent of each other. There are also complications regarding the register file, and what happens when you run out of scratch registers (and steps you take to avoid scratch registers). Pretty much every internal block in each core has to be aware of aspects of the original CISC instruction to handle context switches, incorrectly-predicted branches, load/store blocking, etc. Having designed the scheduling unit for one of these bad boys, it’s really quite a pain in the neck, takes a lot of circuitry, adds to cycle time (slows the chip down), and results in wires running all over the place to send these extra signals from place to place.

Having also designed true RISC CPUs (sparc, mips, PowerPC), x86-64 cores are nothing at all like those.

He has adressed this at multiple occasions. See above a quote I‘d consider relevant.

From what I understand there is no agreed upon definition of what RISC really is or what exactly a chip design must look like to qualify as RISC. That said, x86 does not seem to contain any feature considered required to qualify as RISC (see @cmaier various posts regarding microps, instruction decoding, instruction length, memory access. Plz bear with me if the latter isn‘t 100% technically accurate, I pulled this from the voids of my memory)

Maximara · Mar 28, 2021

Macbookprodude said:
So, your statement above just proved M1 DOES support running windows 10. As of right now, anything Apple says is empty air until I see it for myself. I hate Intel also.. I have a strong love for PowerPC as that is what made the Mac a real mac as far as i am concerned.. BUT M1 does seem to follow that same tradition. WE WILL SEE, I will wait.. M1 selections now are not in my best interest. 13 inch screen gives me headaches and on my eyes.. I have a 2015 MacBook Pro dual graphics - I may trade that up for a 15 inch M1 or M2.. to me, RISC IS THE BEST.

You have to remember why the PowerPC got dropped; much the same reason Intel got dropped - the company involved could not put out CPUs of the speed at a certain power or when originally promised. It didn't help that the whole consortium fell apart leaving Apple out in the wilderness feeling like it just got conned into looking for queen snakes.

Maximara · Mar 28, 2021

09872738 said:
He has adressed this at multiple occasions. See above aquote I‘d consider relevant.

From what I understand there is no agreed upon definition of what RISC really or what exactly a chip design must look like to qualify as RISC. That said, x86 does not seem to contain any feature considered required to qualify as RISC (see @cmaier various posts regarding microps, instruction decoding, registers, memory fetches. Plz bear with me if the latter isn‘t 100% technically accurate, I pulled this from the voids of my memory)

Stanford has a piece on RISC vs CISC In a thumb nail "The CISC approach attempts to minimize the number of instructions per program, sacrificing the number of cycles per instruction. RISC does the opposite, reducing the cycles per instruction at the cost of the number of instructions per program."

Near the end we get this: "Today, the Intel x86 is arguable the only chip which retains CISC architecture."

Macbookprodude · Mar 28, 2021

I understand.. also I am not jumping on M1 yet - 13 inch screen of MB Pro is not big enough for me. Eye strain. I have a 2015 MB Pro which I may trade up towards an M1. To me, M1 feels like the PowerPC days even though Apple’s version of ARM is their own. Kind of like think different days.

its sad Motorola and IBM let Apple down. I am an owner of many Macs - one of my favorite ones is the G4 Titanium PB - wrote a couple of new apps for it to help it along plus the internet is good on it. I have a Mac Pro for video editing. 2015 MacBook Pro dual graphics, 2012 MacBook Pro, PB Pismo G4 for my OS 9 needs, and a Power Mac G5 Quad.

I may get rid of the 2012 and 2015 for an M1 15 inch MacBook Pro, but so far that is in the future.

Macbookprodude · Mar 28, 2021

Maximara said:
Stanford has a piece on RISC vs CISC In a thumb nail "The CISC approach attempts to minimize the number of instructions per program, sacrificing the number of cycles per instruction. RISC does the opposite, reducing the cycles per instruction at the cost of the number of instructions per program."

Near the end we get this: "Today, the Intel x86 is arguable the only chip which retains CISC architecture."

Ok, but where does this leave RISC ? M1 and PowerPC are RISC. I also like how this website was clean and not bloated with ads or YouTube video junk. I just loaded that page on my PB G4 and it flew scrolling to the bottom.

Maximara · Mar 28, 2021

Macbookprodude said:
Ok, but where does this leave RISC ? M1 and PowerPC are RISC.

Yes but there is a key difference. The 68x00 was used by few outside of Apple and the transition to PowerPC was not exactly smooth. More over trying to run x86 code on it was not the most pleasant thing. The M1 by contrast runs x86 code reasonably fast and runs ARM code better than what MS was using.

The reality is the majority of the world is ARM (RISC) with PCs being one of the last bastions of CISC. Intel don't seem to know what it wants to do. It puts out a bunch of clueless commercials slapping Apple around and follows up with 'we would like to work with Apple'. Say what?!

It is like Intel doesn't realize it not only burnt the bridge to Apple but poured gasoline on the remains and set fire to that. Then they realized 'uh perhaps we still needed that bridge'.

Macbookprodude · Mar 28, 2021

So RISC did win after all

cmaier · Mar 28, 2021

thekev said:
Didn't he just claim ARM and x86 are nothing alike? ARM itself isn't that similar to a classic RISC architecture. It seems pretty strained to call instruction sets with FMA and saturating arithmetic instructions "reduced".

ARM is clearly RISC. Memory accesses limited to essentially LDR, LDM, STR, STM, SWP, and PLD instructions, large register count, fixed instruction lengths (within a given mode), no instructions require microcoding, etc. All classic hallmarks of RISC. Like every RISC architecture it has its own quirks (e.g. conditional instructions), but it is fundamentally similar to MIPS, PowerPC, and SPARC, which are all CPUs I have designed. And it is *very* different than x86 and x86-64 (which are also CPUs I have designed).

Ask anyone who actually designs CPUs, and they will tell you that x86 is clearly CISC, pretty much everything else now is RISC, and the differences are easily visible in the complexity of the designs.

Macbookprodude · Mar 28, 2021

I knew this even 20 years ago.

thekev · Mar 28, 2021

cmaier said:
ARM is clearly RISC. Memory accesses limited to essentially LDR, LDM, STR, STM, SWP, and PLD instructions, large register count, fixed instruction lengths (within a given mode), no instructions require microcoding, etc.

The memory access is the main thing that seemingly remains unchanged over the years. ARM uses only standalone loads and provides a large number of register names, whereas x86 provides a number of instructions which accept a pointer as their final argument.

The disparity in register names (as opposed to available register space) isn't that large though. Neon uses 32 simd register names. Intel uses 16 for the most part. AVX512 exposes 32.

cmaier said:
no instructions require microcoding, etc.

I would be surprised if things like VSQRT were implemented without microcoding.

Documentation – Arm Developer

developer.arm.com

cmaier · Mar 28, 2021

thekev said:
The memory access is the main thing that seemingly remains unchanged over the years. ARM uses only standalone loads and provides a large number of register names, whereas x86 provides a number of instructions which accept a pointer as their final argument.

The disparity in register names (as opposed to available register space) isn't that large though. Neon uses 32 simd register names. Intel uses 16 for the most part. AVX512 exposes 32.

I would be surprised if things like VSQRT were implemented without microcoding.

Documentation – Arm Developer

developer.arm.com

Why? I implemented floating point square root for the follow-up to the PowerPC x704, and we certainly didn’t have any microcode. The load/store unit sends the instruction to the ALU where it goes to the sqrt unit (which is a lookup table and some newton-Raphson magic, if I remember correctly - that was in 1996 or 1997 so my memory is vague), and that ALU takes however many cycles as necessary before setting the “I’m done” signal which tells the retirement circuitry that the contents of the bypass register are valid.

There’s a weird idea going around (i think the stanford link above suggested it too) that “complicated” instructions require microcode. Multiplication was given as an example. Multiplication and division can take many clock cycles, but there’s no microcode used on any RISC processor I’ve ever seen. There’s an ALU, you tell it you want to do div or mul, and it takes multiple cycles, and signals when it is done. If the instruction requires passing data from the output of one part of the ALU into an input of another part of the ALU, that’s handled by sequencing logic within the ALU. (That’s rare. I seem to recall I did that in an integer divider, once, where I had to feed something from one circuit into the input of the integer multiplier).

leman · Mar 29, 2021

cmaier said:
ARM is clearly RISC. Memory accesses limited to essentially LDR, LDM, STR, STM, SWP, and PLD instructions, large register count, fixed instruction lengths (within a given mode), no instructions require microcoding, etc. All classic hallmarks of RISC. Like every RISC architecture it has its own quirks (e.g. conditional instructions), but it is fundamentally similar to MIPS, PowerPC, and SPARC, which are all CPUs I have designed. And it is *very* different than x86 and x86-64 (which are also CPUs I have designed).

Ask anyone who actually designs CPUs, and they will tell you that x86 is clearly CISC, pretty much everything else now is RISC, and the differences are easily visible in the complexity of the designs.

There is of course no doubt that ARM is RISC... my argument is that "RISC" and "CISC" in their original sense are notions from early days of CPUs and represent extreme poles on the design spectrum that are simply not useful or even relevant for today's high-performance computing. Current designs are always hybrids.

You simply can't make a fast CISC CPU without backing it by some sort of reduced architecture (like x86 CPUs do with microcode). And you also can't make a fast RISC CPU without giving it complex operations, like ARMv8 ability to store/load multiple registers via a single instruction or the auto-increment addressing modes. And while most modern RISC CPUs might not use microcode in the classical sense, they absolutely do split operations into micro-ops.

The point being: labels like RISC and CISC trivialize the discussion. We need to look at the actual relevant differences between the ISA (variable width vs. fixed width, load/store vs register–memory, addressing modes, code density) and between the hardware implementation (wide vs. narrow backend, cache architecture, OOE capabilities, branch prediction etc.). Intel is not faster than ARM Cortex because one is CISC and another one is RISC, and Apple is not faster than Intel because one is RISC and another one is CISC. There are simply different design optimization points and there are worse and better designs for each of the optimization point. That is something we should talk about, not trying to pigeonhole CPUs into one of the two loosely defined and frankly unhelpful buckets.

Macbookprodude said:
I guess I am a traditionalist from the old school of cpu architectures. I was in college when the PPC G3/MAINLY G4 era came out.

Sorry, but what you are saying makes zero sense. CPUs are computing devices, not fashion items. If you are "traditionalist" (whatever that means), you shouldn't be using any computer made in the last 20 years.

Ploki · Mar 29, 2021

Macbookprodude said:
Your interpretation and opinion.. but I think they are dead on.. then again, I don’t support Apple after Jobs died.. M1 may have greatness, but according to those reports I think they are doing the same s*** during the MHz myth days. Only time will tell and so far Apple is playing the alienation game by not allowing other OS’s to run on their pathetic closed hardware.

no really, i don't care what reports say because i have one - tested it hands on with intel based macs and also versus desktop PC's.
It's insanely good and that intel propaganda is just pathetic

PBG4 Dude · Mar 29, 2021

Unregistered 4U said:
Of course, some would say Apple of the 1980’s was better.

For some reason, this discussion hits me like the debate whether David Lee Roth or Sammy Hagar were the better singer for Van Halen. Roth put Van Halen on top of the world(! Oh yeah!). Hagar had plenty of hits with Van Halen, but he wasn’t as amazing a frontman as Diamond Dave, who was the face of the Van Halen brand.

The same could be said for Apple. Tim Cook could announce the Apple Pill. Take it, and internet connected AR lenses grow inside your eye. People would say yeah that’s cool, but imagine what Steve would have done. I don’t know, Monday morning brain fart; don’t mind me.

cmaier · Mar 29, 2021

leman said:
There is of course no doubt that ARM is RISC... my argument is that "RISC" and "CISC" in their original sense are notions from early days of CPUs and represent extreme poles on the design spectrum that are simply not useful or even relevant for today's high-performance computing. Current designs are always hybrids.

You simply can't make a fast CISC CPU without backing it by some sort of reduced architecture (like x86 CPUs do with microcode). And you also can't make a fast RISC CPU without giving it complex operations, like ARMv8 ability to store/load multiple registers via a single instruction or the auto-increment addressing modes. And while most modern RISC CPUs might not use microcode in the classical sense, they absolutely do split operations into micro-ops.

The point being: labels like RISC and CISC trivialize the discussion. We need to look at the actual relevant differences between the ISA (variable width vs. fixed width, load/store vs register–memory, addressing modes, code density) and between the hardware implementation (wide vs. narrow backend, cache architecture, OOE capabilities, branch prediction etc.). Intel is not faster than ARM Cortex because one is CISC and another one is RISC, and Apple is not faster than Intel because one is RISC and another one is CISC. There are simply different design optimization points and there are worse and better designs for each of the optimization point. That is something we should talk about, not trying to pigeonhole CPUs into one of the two loosely defined and frankly unhelpful buckets.

Sorry, but what you are saying makes zero sense. CPUs are computing devices, not fashion items. If you are "traditionalist" (whatever that means), you shouldn't be using any computer made in the last 20 years.

But you seem to be basing your argument on the idea that something has changed. Unlike in the “early days,” now x86 cpus have a “reduced architecture” because of microcode.

But CISC machines have always had microcode (though they sometimes didn’t do it with microcode ROMs). That’s what makes them CISC machines. If you see microcode, you’ve got CISC.

And the labels are incredibly useful. In the industry, it’s shorthand for all the things I keep mentioning. Universally, RISC means the same thing to us. Same with CISC.

And when you compare two designs in the modern era where there is always enough instruction memory, ceteris paribus, the RISC one wins. Yes, what you call hardware implementation (and the rest of us call microarchitecture, because hardware implementation is a different thing), is important. And a better micro architecture can result in overcoming the RISC advantage, just as a better implementation, or a better semiconductor process can. But that doesn’t mean you can ignore RISC/CISC. Because if two products are made on the same process, using the same cell library, and the same macro circuits, and the same physical design techniques, using the same microarchitecture, then the RISC design wins in performance and performance per watt. That’s what makes these ”buckets” helpful.

ANd it’s disingenuous to say they are loosely defined. They are not. Everyone agrees that if you see microcode [more specifically the need for a state machine in the instruction decode unit], addressing modes where random instructions access memory, variable instruction lengths outside of modes [more specifically, the requirement to scan the instruction stream to determine instruction end points], it’s CISC.

When you mention things like autoincrement or multi-register load/store as “complex,” that’s never been what “complex” means in RISC vs. CISC. Complex has always referred to the decoding/issuing. It’s trivial to implement an increment - it does not complicate the pipelines or require interactions between multiple “micro-ops.” The instruction decoder simply sends a single flag signal to the ALU, and the ALU does it as the last step. Multiple load/stores in parallel - same deal. Lots of possible implementations, but it doesn’t even need to take more than once cycle (other than the memory accesses - the RF can be multi-ported).

What makes an instruction complex is ”oh, before i can perform this first I have to wait 50 cycles to read this memory location, which could cause a cache miss. Since I have to add the result of register AX to that, I need someplace to hold the results temporarily. Hopefully that didn’t cause an overflow. If it didn’t, then I use the result of the sum as an address to store the results of this subtract involving two other registers.” I have to use multiple parts of the chip (load/store unit, ALU, etc.) in sequence, with dependencies between steps so that I can’t just let things fly and work on something else while waiting for the results. Each of the things you mention as “complex,’ by contrast, are just things that happen within a single unit, and are more parallel or take longer than, say, a shift left instruction. Every real RISC machine ever has likely had an integer multiplier that takes 4-10 times as long to reach a result as an integer addition. That doesn’t make the integer multiply instruction “complex.”. Same with square root, divide, etc. Same with autoincrement - the results of add-plus-increment do not require that I add, store the results somewhere, then send a new increment instruction into the ALU. Most likely the ALU feedback register is simply an adder with a bypass, and you’re done.

leman · Mar 29, 2021

cmaier said:
But you seem to be basing your argument on the idea that something has changed. Unlike in the “early days,” now x86 cpus have a “reduced architecture” because of microcode.

But CISC machines have always had microcode (though they sometimes didn’t do it with microcode ROMs). That’s what makes them CISC machines. If you see microcode, you’ve got CISC.

And the labels are incredibly useful. In the industry, it’s shorthand for all the things I keep mentioning. Universally, RISC means the same thing to us. Same with CISC.

And when you compare two designs in the modern era where there is always enough instruction memory, ceteris paribus, the RISC one wins. Yes, what you call hardware implementation (and the rest of us call microarchitecture, because hardware implementation is a different thing), is important. And a better micro architecture can result in overcoming the RISC advantage, just as a better implementation, or a better semiconductor process can. But that doesn’t mean you can ignore RISC/CISC. Because if two products are made on the same process, using the same cell library, and the same macro circuits, and the same physical design techniques, using the same microarchitecture, then the RISC design wins in performance and performance per watt. That’s what makes these ”buckets” helpful.

ANd it’s disingenuous to say they are loosely defined. They are not. Everyone agrees that if you see microcode [more specifically the need for a state machine in the instruction decode unit], addressing modes where random instructions access memory, variable instruction lengths outside of modes [more specifically, the requirement to scan the instruction stream to determine instruction end points], it’s CISC.

When you mention things like autoincrement or multi-register load/store as “complex,” that’s never been what “complex” means in RISC vs. CISC. Complex has always referred to the decoding/issuing. It’s trivial to implement an increment - it does not complicate the pipelines or require interactions between multiple “micro-ops.” The instruction decoder simply sends a single flag signal to the ALU, and the ALU does it as the last step. Multiple load/stores in parallel - same deal. Lots of possible implementations, but it doesn’t even need to take more than once cycle (other than the memory accesses - the RF can be multi-ported).

What makes an instruction complex is ”oh, before i can perform this first I have to wait 50 cycles to read this memory location, which could cause a cache miss. Since I have to add the result of register AX to that, I need someplace to hold the results temporarily. Hopefully that didn’t cause an overflow. If it didn’t, then I use the result of the sum as an address to store the results of this subtract involving two other registers.” I have to use multiple parts of the chip (load/store unit, ALU, etc.) in sequence, with dependencies between steps so that I can’t just let things fly and work on something else while waiting for the results. Each of the things you mention as “complex,’ by contrast, are just things that happen within a single unit, and are more parallel or take longer than, say, a shift left instruction. Every real RISC machine ever has likely had an integer multiplier that takes 4-10 times as long to reach a result as an integer addition. That doesn’t make the integer multiply instruction “complex.”. Same with square root, divide, etc. Same with autoincrement - the results of add-plus-increment do not require that I add, store the results somewhere, then send a new increment instruction into the ALU. Most likely the ALU feedback register is simply an adder with a bypass, and you’re done.

Very enlightening, thank you for the elaborate reply. I always thought that RISC vs. CISC boiled down to “direct implementation” vs. “microcode-controlled implementation ”, and that’s why I believed that focusing too much discussion on this aspect is overly simplistic in a context of an OOE, speculative CPU, where many more things are going on simultaneously. I admit that I might have underestimated their importance however. It’s always great to get some professional insight on the matter.

Andropov · Mar 29, 2021

I also thought that the RISC/CISC boundary was less clear now as it were a couple decades ago. I remember reading this arstechnica article on this matter some time ago and their arguments seemed to be very convincing.

thekev · Mar 29, 2021

cmaier said:
Why? I implemented floating point square root for the follow-up to the PowerPC x704, and we certainly didn’t have any microcode. The load/store unit sends the instruction to the ALU where it goes to the sqrt unit (which is a lookup table and some newton-Raphson magic, if I remember correctly - that was in 1996 or 1997 so my memory is vague), and that ALU takes however many cycles as necessary before setting the “I’m done” signal which tells the retirement circuitry that the contents of the bypass register are valid.

There’s a weird idea going around (i think the stanford link above suggested it too) that “complicated” instructions require microcode. Multiplication was given as an example. Multiplication and division can take many clock cycles, but there’s no microcode used on any RISC processor I’ve ever seen. There’s an ALU, you tell it you want to do div or mul, and it takes multiple cycles, and signals when it is done. If the instruction requires passing data from the output of one part of the ALU into an input of another part of the ALU, that’s handled by sequencing logic within the ALU. (That’s rare. I seem to recall I did that in an integer divider, once, where I had to feed something from one circuit into the input of the integer multiplier).

The square root example came to mind in particular due to the potential number of micro-ops potentially involved there. I wasn't thinking so much of complicated instructions from a conceptual viewpoint as those that commonly take many micro ops.

Floating point multiplication doesn't take many cycles on any ARM processor constructed with performance in mind. The optimization guides in many of these latency tables suggest around 4 cycles is common for both add and subtract variants of FMA3. A long time ago, yeah multiplies would have still been expensive, and minimizing their use could still be advantageous in some rare scenarios (eg, you're programming for an exotic target like an FPGA). I definitely wouldn't have commented on multiplies as "complex" though, which is why I mention them here.

leman · Mar 29, 2021

thekev said:
The square root example came to mind in particular due to the potential number of micro-ops potentially involved there. I wasn't thinking so much of complicated instructions from a conceptual viewpoint as those that commonly take many micro ops.

Floating point multiplication doesn't take many cycles on any ARM processor constructed with performance in mind. The optimization guides in many of these latency tables suggest around 4 cycles is common for both add and subtract variants of FMA3.

Just since you are mentioning this, square root seems to be a single uop on M1 and FMA has 4 cycles of latency (with up to four FMA operations executed per cycle, that's up to 16 FP32 FMA per cycle)

Source: https://dougallj.github.io/applecpu/firestorm-simd.html

ARM based MacBook Pro discussion [merged]

macrumors G4

macrumors Core

Suspended

Cancelled

macrumors 604

macrumors 604

Cancelled

macrumors 68000

macrumors 68000

Suspended

Suspended

macrumors 68000

Suspended

Suspended

Suspended

macrumors 604

Suspended

macrumors Core

macrumors 601

macrumors 601

Suspended

macrumors Core

macrumors 6502a

macrumors 604

macrumors Core

Our Staff