Apple Develops Breakthrough Method for Running LLMs on iPhones

dgdosen · Apr 30, 2024

Frantisekj said:
From my short experience AI is now mainly trained by curators to give "curated" answers and keep us in mental prison we are in now. Twice I managed after long fiddling to get unbiased answer but then was logged of immediately. It has never happened in other any occasion.

So... Healthy skepticism?

Zest28 · Apr 30, 2024

Siri 2.0 / Apple GPT is going to be a lot of fun I tell you.

Zest28 · Apr 30, 2024

Frantisekj said:
From my short experience AI is now mainly trained by curators to give "curated" answers and keep us in mental prison we are in now. Twice I managed after long fiddling to get unbiased answer but then was logged of immediately. It has never happened in other any occasion.

Let’s think about this for a minute. Do you really think ChatGPT would be useful it was unfiltered as it is trained for a large part on Reddit data?

An AI that is a Redditor is not going to be very good, so it needs some “supervision”.

purplerainpurplerain · Apr 30, 2024

seek3r said:
Not that I think we really need massive LLMs on our phones, but that's what beefing up the neural processor, reportedly a major focus of the M4 should be good for. Running such things on a part of the silicon engineered for it is a lot more power efficient than on the general part of the CPU

Problem there is neural engine isn’t as powerful for LLMs as the GPU is. That’s why all the local LLMs use the GPU. The GPU has the memory bandwidth and grunt.

For the Apple Silicon’s neural engine to match the GPU at this task the size of the NE will have to be much bigger.

It is suited for small models on phones though.

Zest28 · Apr 30, 2024

purplerainpurplerain said:
Problem there is neural engine isn’t as powerful for LLMs as the GPU is. That’s why all the local LLMs use the GPU. The GPU has the memory bandwidth and grunt.

For the Apple Silicon’s neural engine to match the GPU at this task the size of the NE will have to be much bigger.

It is suited for small models on phones though.

I’d say Apple is creating a solution for a problem that doesn’t exist.

An iPhone has an active internet connection so their LLM model can run on powerful servers like everybody else.

purplerainpurplerain · Apr 30, 2024

Zest28 said:
I’d say Apple is creating a solution for a problem that doesn’t exist.

An iPhone has an active internet connection so their LLM model can run on powerful servers like everybody else.

That will be one option. There will be a local option called something like Siri Plus.

Then the cloud options will be Siri Max, ChatGPT, etc. those options will be available on a dropdown menu just like choosing a browser search engine.

Still stupid to use as a search engine replacement in the current state. Whatever results an LLM gives you has to be double checked if the subject is outside your field of knowledge. Incorrect answers and big carbon footprint is a kind of disaster capitalism.

seek3r · Apr 30, 2024

Zest28 said:
I’d say Apple is creating a solution for a problem that doesn’t exist.

An iPhone has an active internet connection so their LLM model can run on powerful servers like everybody else.

Turn off internet and see how much more responsive siri is for things that can be handled locally. That’s why

frownface · Apr 30, 2024

fizzyfizz said:
Sounds great! But please call it something other than Siri as that name fills me with dread every time I consider giving it another try.

good bye siri, say hellow to Iris; name because it's always watching.

name99 · May 1, 2024

nwiggin said:
Assuming 8-bit quantization, for 200b parameters you would need over 100gb of ram. Will the next iPhone have 128gb ram??

Only if the parameters are non-zero...
As I have said multiple times, the intrinsic degree of sparsity [ie how sparse you can force them if you try] of these models is not (as far as I know) publicly known. If you are running on nVidia, there's no reason to push for anything beyond nVidia's "structured sparsity" of .5, but it may well be possible to shift sparsity down to 10% or less.

Secondly you're assuming the entire model needs to be in RAM. If you structure the model appropriately (eg order tokens by frequency of reference) you may find that most of the model runs "cold", is rarely referenced, and can be demand paged out of flash. We've seen at least one paper from Apple that discusses elements of this idea.

Third the most interesting large models are "mixture of experts" where you engage in some preliminary layers of the model, then specialize to a specific sub-model. The entire collection consists of 200B (or whatever) parameters, but any one of the ten sub-models may be a tenth of that size.

name99 · May 1, 2024

purplerainpurplerain said:
You can fill the RAM but your processor and battery is still being inundated by an LLM. An LLM has no place on a phone. It’s bad enough in a MacBook Pro.

As an example I have ran all the open source LLMs on M3 Max. They chew through the battery with the worst offender being Command-R which can deplete the whole battery in less than an hour. Even a Blender render is less stressful.

On a phone you need specialized language models that don’t need to ‘know it all’. A small size model can do many things without needing to ‘know it all’.

And you don’t need an LLM to switch to dark mode or set an alarm. That’s just voice commands which have been around since OS 9.1 classic.

And those models run where? On the CPU? On the GPU?
Every OSS model I have seen was developed for nVidia and runs by default on the CPU. Maybe (possibly, but don't hold your breath) someone optimized it for the Apple GPU, but they almost certainly did not even attempt to ensure that it actually runs on the ANE. (Partially bcs that requires knowing what the ANE is capable, and very few people outside Apple know that; partially because that requires real skill at knowing what each layer of the model does, and how it can be modified or not; which very few people outside the big tech companies know.)

Point is, what your "nVidia-optimized, now running on a CPU, never seriously ported to Apple" model can do is very different from what might be possible by an Apple team that know their target hardware, know which layers in a model can be restructured to optimize for the hardware, and can retrain the newly structured model to compensate for the layer changes.

name99 · May 1, 2024

Zest28 said:
Let’s think about this for a minute. Do you really think ChatGPT would be useful it was unfiltered as it is trained for a large part on Reddit data?

An AI that is a Redditor is not going to be very good, so it needs some “supervision”.

You (like almost everyone in this thread) is conflating a Large Language Model with a Chatbot.
A large language model is something that can usefully "understand" natural english.
A Chatbot is something that can answer natural english questions.

You can use a large language model for purposes (very useful purposes) that don't include answering questions. Things like translation, better summarizing articles, better grammar checkers, detecting things in mail and texts (smarter versions of today's Mail's "you referred to an enclosure but did not include one" or "follow up on this email in three days). Such an LLM could, for example, do a better job than today's Siri of mapping a particular natural language utterance (including in the face of noise) to one of the various commands that Siri knows how to perform.

A Chatbot is a way to answer questions. It makes sense when the question and answer are simple; it's stupid beyond belief when the answer is not simple ("Hey Siri, explain to me how Quantum Electrodynamics Works"). To improve Siri as a Chatbot, Apple will have to steer a course between the way intelligent people expect to use such functionality and the idiot masses who think it's a magic box that can answer any question of any form whatsoever. I'm primarily interested in the small subset of sane and intelligent users. For example, Reddit is, in fact, frequently very useful. But it's useful in terms of scanning through multiple answers, latching onto the few that seem relevant and not unhinged, and extracting value from them. Not functionality I expect from a Chatbot...

Finally, how does a Chatbot PROVIDE its answers? One option is to essentially compile the entire internet (or at least the training corpus) into eleventy trazillion parameters. But this is not the only way, and not necessarily optimal. Another option is a much smaller model which understand english, enough to know that "this type of query gets handled by this sort of code" and "that type of query by that type of code". This is essentially what Siri is today (some queries go to Wolfram.com, some go to Wikipedia, some go to some database of Apple hardware info, etc etc); what an LLM brings is a conversion of the spoken string into a final "normalized utterance" representing what type of request we want handled.

There is, in other words, no contradiction between
- "Apple is creating its own LLM for on-device use" (siri front-end) and
- "Apple is creating its own massive LLM" (one or more of the siri back-ends to certain questions, eg "How do I pair my new Apple Watch to replace my old Apple Watch?")
- "Apple is negotiating with Google to use the Gemini LLM" (act as one or more of the different siri back-ends, now to answer factual questions like "When was the Luxor Hotel in Las Vegas built?")

name99 · May 1, 2024

purplerainpurplerain said:
Problem there is neural engine isn’t as powerful for LLMs as the GPU is. That’s why all the local LLMs use the GPU. The GPU has the memory bandwidth and grunt.

For the Apple Silicon’s neural engine to match the GPU at this task the size of the NE will have to be much bigger.

It is suited for small models on phones though.

Misleading.
In terms of raw FMA throughput, ANE (current version) consists of 16 cores, and each core can handle 256 FMAs per cycle. A GPU core can handle 128 FMAs per cycle, so that's literally 32 cores worth of GPU – which is less than the highest end of the M range, but still pretty respectable if you have just a phone or an M class (not a Max or Ultra).

Raw FMA is not the full story, of course. ANE hardware is much more structured, meaning lower energy to achieve the same task. But this same structure means that when a neural model is constructed that is not matched to that structure, then the model will run badly to not at all on the hardware, and it needs to run somewhere else.
It's like if you constructed a bunch of code running using FP64. Maybe you don't need the precision of FP64, but by using that in your code, you've forced it to run on the CPU because the GPU just doesn't support FP64. The problem is less with the GPU than with a developer who doesn't understand what is or is not essential in the code they are trying to port to a device...

Point is – what some random LLM from hugging face does on Apple Silicon is far from what's really possible on Apple Silicon. Slowly we are seeing elements of this fixed, but it's a slow process (compare to slowly getting people to use Metal and Apple Silicon GPUs correctly, and that's been on-going for a lot longer).

name99 · May 1, 2024

seek3r said:
Turn off internet and see how much more responsive siri is for things that can be handled locally. That’s why

Another version of this: download two language models in Apple Translate (eg Chinese and English) then see how well translation works locally (including eg translation of images). It's pretty damn impressive.

ric22 · May 2, 2024

MacRumors said:
Apple GPT in your pocket? It could be a reality sooner than you think. Apple AI researchers say they have made a key breakthrough in deploying large language models (LLMs) on iPhones and other Apple devices with limited memory by inventing an innovative flash memory utilization technique.

LLMs and Memory Constraints

LLM-based chatbots like ChatGPT and Claude are incredibly data and memory-intensive, typically requiring vast amounts of memory to function, which is a challenge for devices like iPhones that have limited memory capacity. To tackle this issue, Apple researchers have developed a novel technique that uses flash memory – the same memory where your apps and photos live – to store the AI model's data.

Storing AI on Flash Memory

In a new research paper titled "LLM in a flash: Efficient Large Language Model Inference with Limited Memory," the authors note that flash storage is more abundant in mobile devices than the RAM traditionally used for running LLMs. Their method cleverly bypasses the limitation using two key techniques that minimize data transfer and maximize flash memory throughput:

Windowing: Think of this as a recycling method. Instead of loading new data every time, the AI model reuses some of the data it already processed. This reduces the need for constant memory fetching, making the process faster and smoother.

Row-Column Bundling: This technique is like reading a book in larger chunks instead of one word at a time. By grouping data more efficiently, it can be read faster from the flash memory, speeding up the AI's ability to understand and generate language.

The combination of these methods allows AI models to run up to twice the size of the iPhone's available memory, according to the paper. This translates to a 4-5 times increase in speed on standard processors (CPUs) and an impressive 20-25 times faster on graphics processors (GPUs). "This breakthrough is particularly crucial for deploying advanced LLMs in resource-limited environments, thereby expanding their applicability and accessibility," write the authors.

Faster AI on iPhone

The breakthrough in AI efficiency opens new possibilities for future iPhones, such as more advanced Siri capabilities, real-time language translation, and sophisticated AI-driven features in photography and augmented reality. The technology also sets the stage for iPhones to run complex AI assistants and chatbots on-device, something Apple is already said to be working on.

Apple's work on generative AI could eventually be incorporated into its ‌Siri‌ voice assistant. Apple in February 2023 held an AI summit and briefed employees on its large language model work. According to Bloomberg, Apple is aiming for a smarter version of Siri that's deeply integrated with AI. Apple is planning to update the way that ‌Siri‌ interacts with the Messages app, allowing users to field complex questions and auto-complete sentences more effectively. Beyond that, Apple is rumored to be planning to add AI to as many Apple apps as possible.

Apple GPT

Apple is reportedly developing its own generative AI model called "Ajax". Designed to rival the likes of OpenAI's GPT-3 and GPT-4, Ajax operates on 200 billion parameters, suggesting a high level of complexity and capability in language understanding and generation. Internally known as "Apple GPT," Ajax aims to unify machine learning development across Apple, suggesting a broader strategy to integrate AI more deeply into Apple's ecosystem.

As of the latest reports, Ajax is considered more capable than the earlier generation ChatGPT 3.5. However, it's also suggested that OpenAI's newer models may have advanced beyond Ajax's capabilities as of September 2023.

Both The Information and analyst Jeff Pu claim that Apple will have some kind of generative AI feature available on the ‌iPhone‌ and iPad around late 2024, which is when iOS 18 will be coming out. Pu said in October that Apple is building a few hundred AI servers in 2023, with more to come in 2024. Apple will reportedly offer a combination of cloud-based AI and AI with on-device processing.

Article Link: Apple Develops Breakthrough Method for Running LLMs on iPhones

Using flash memory? Sounds like it wouldn't be quick and would likely thrash the poor tiny SSDs Apple give us. No thanks.

name99 · May 2, 2024

ric22 said:
Using flash memory? Sounds like it wouldn't be quick and would likely thrash the poor tiny SSDs Apple give us. No thanks.

Isn't it great that the people designing future Apple products are actual, you know, engineers, as opposed to rando's on the internet?

Frantisekj · May 6, 2024

Zest28 said:
Let’s think about this for a minute. Do you really think ChatGPT would be useful it was unfiltered as it is trained for a large part on Reddit data?

An AI that is a Redditor is not going to be very good, so it needs some “supervision”.

Yaeh. We all live under supervision that we do not feel it already ... So no big change for the most.

Frantisekj · May 6, 2024

dgdosen said:
So... Healthy skepticism?

I see it clearly and it is not just my experience.

Search

Search

Apple Develops Breakthrough Method for Running LLMs on iPhones

dgdosen

macrumors 68030

Zest28

macrumors 68020

Zest28

macrumors 68020

purplerainpurplerain

macrumors 6502a

Zest28

macrumors 68020

purplerainpurplerain

macrumors 6502a

seek3r

macrumors 68020

frownface

macrumors 6502

name99

macrumors 68020

name99

macrumors 68020

name99

macrumors 68020

name99

macrumors 68020

name99

macrumors 68020

ric22

macrumors 68020

name99

macrumors 68020

Frantisekj

macrumors 6502a

Frantisekj

macrumors 6502a

Our Staff