LFM2.5-230M: An Ultra-Compact Model Runs on a Raspberry Pi and Almost Any Modern Phone

Liquid AI released LFM2.5-230M — one of the smallest language models out there today, at just 230 million parameters. It’s compact enough to run on a small device without trouble: it needs 293 MB of memory on a Raspberry Pi 5 and 375 MB on a smartphone. On the Raspberry Pi 5 the model outputs 42 tokens per second during decoding, and on the flagship Galaxy S25 Ultra it speeds up to 213 tokens per second. A person reads at roughly 5–7 tokens per second, so even on a cheap single-board computer the model “types” its answer faster than you can read it, and on a phone the text appears almost instantly. On benchmarks the model competes with models twice its size and often beats them. The project is fully open: the base version (LFM2.5-230M-Base) and the post-trained one (LFM2.5-230M) can both be downloaded from Hugging Face together with the weights, the source and SDK live on Github, and the full model library is described in the documentation. The weights are open, meaning the models can be downloaded, fine-tuned, and deployed without restrictions.

Where it actually runs

The model’s main constraint is simple: it needs about 293–375 MB of RAM and a reasonably modern processor.

Four scenarios are confirmed by the authors’ own tests: cloud GPUs (H100), the flagship Galaxy S25 Ultra smartphone, the Raspberry Pi 5 single-board computer, and the Unitree G1 humanoid robot via its onboard NVIDIA Jetson Orin chip. Support is claimed for Apple, AMD, Qualcomm, and Nvidia chips, so the range is wider. Since the model runs on a Snapdragon Gen4 and on a Raspberry Pi, almost any modern phone or tablet with 4+ GB of memory will handle it, and laptops, desktops, and servers have resources to spare. Here the bottleneck won’t be memory but processor speed.

Smartwatches sit in a gray zone: top models with 2 GB of memory and a Snapdragon chip could theoretically run it, but the blog has no such tests, and in practice the obstacle isn’t memory but heat and battery. On microcontrollers and simple consumer electronics (kettles, light bulbs, fitness trackers, earbuds) it won’t run at all: memory there is measured in kilobytes, not hundreds of megabytes, and a thousandfold gap can’t be closed by any optimization.

LFM2.5-230M is built on the LFM2 architecture, and thanks to this it runs noticeably faster than models of the same size, including hybrids based on SSM (state space models) and Gated Delta Networks. The model is especially good at two things: tool use and data extraction.

That said, the authors are upfront: for tasks that require a lot of reasoning — advanced math, code generation, creative writing — this model isn’t the right pick. It’s built for something else.

How LFM2.5-230M was trained

The model was pre-trained on 19 trillion tokens. Its context window is 32K.

After that comes a light post-training stage in three steps. First, supervised fine-tuning with knowledge distillation from the larger LFM2.5-350M model, meaning the small model learned to imitate the big one. Then direct preference optimization (DPO) — a way to tune the model toward the answers people prefer. And finally multi-domain reinforcement learning, that is, training across several types of tasks at once. This recipe is meant to make the model work well out of the box while staying easy to fine-tune for your own specific task.

The model was put on the Unitree G1 humanoid robot, where it ran entirely on the device itself, on the built-in NVIDIA Jetson Orin chip. The model acted as a skill-selection layer: it takes a single plain-language command such as “hold still for 2 seconds, then walk forward at 1 m/s for 3 meters” and breaks it down into a sequence of calls to ready-made low-level skills. So a 230-million-parameter model can serve as a natural-language control interface for a robot.

What you can embed the model into

For perspective, real robots already run on that same Raspberry Pi 5. There’s the two-legged TonyPi humanoid from Hiwonder, which walks, bends over, and picks up objects, robot dogs like PiDog, and wheeled platforms with lidar and autonomous navigation (LanderPi, MentorPi, ROSMASTER X3). The Pi 5 can handle all of this because its processor is roughly three times faster than the Pi 4’s, plus it has a slot for an AI accelerator.

A language model in such a robot plays the role of a “brain” that parses plain-language commands and breaks them down into concrete actions. That’s exactly the task LFM2.5-230M is built for. Liquid AI demonstrated this on the Unitree G1 robot, except that one runs on the more powerful NVIDIA Jetson Orin chip rather than a Pi. But since the model fits into 293 MB on a Raspberry Pi 5, nothing stops you from putting it on a simpler homemade robot.

The benchmark numbers

The model was run through ten benchmarks. The headline result: despite its size, it competes with models twice as large and often beats them. The tests cover different skills: knowledge (GPQA Diamond, MMLU-Pro), instruction following (IFEval, IFBench, Multi-IF), data extraction (CaseReportBench), and tool use (BFCLv3, BFCLv4, τ²-Bench Telecom and Retail).

On IFEval the model scored 71.71, beating both Gemma 3 1B IT (63.49) and Qwen3.5-0.8B (59.94), even though both are larger. On IFBench it scored 38.40 against 22.87 for Qwen and 20.33 for Gemma. On CaseReportBench (data extraction) it hit 22.51, higher than every competitor except the larger model in the same family. Its weak spot shows on τ²-Bench Telecom: just 5.26, where the model struggles. On MMLU-Pro it also trails Qwen3.5-0.8B (20.25 against 37.42) — its compact size shows.

The authors’ takeaway: LFM2.5-230M is a great fit for large-scale data-extraction pipelines or lightweight agentic workloads right on the device.

How fast it is

The model ships with day-one support across the main inference ecosystem: llama.cpp (GGUF checkpoints for edge), MLX (for Apple chips), vLLM and SGLang (for GPU servers), and ONNX (for various accelerators).

The CPU numbers are these (the Fast Inference Everywhere section): on the Raspberry Pi 5 the model delivers 523 tokens per second on prefill and 42 on decode, using just 293 MB of memory, less than any competitor. On the Snapdragon Gen4 (Galaxy S25 Ultra) it’s 1,158 tokens on prefill and 213 on decode at 375 MB of memory. Prefill is processing the input text, while decode is generating the response step by step, token by token.

On an H100 GPU the model has the lowest latency at every concurrency level: from about 50 ms at one request to roughly 205 ms at 64 simultaneous requests. For comparison, Qwen3.5-0.8B reaches about 530 ms at the same 64 requests.

Who it competes with on compactness

The class of models below 500M parameters is narrow but not empty:

SmolLM2-135M and SmolLM2-360M from Hugging Face (135M and 360M parameters, trained on 2T and 4T tokens);
Qwen2.5-0.5B from Alibaba (around 500M, supports up to 128K context and 29 languages);
Qwen2-0.5B (also roughly 300M excluding embeddings);
Granite 4.0-350M from IBM (350M, in standard and hybrid H versions).

On benchmarks LFM2.5-230M holds up well. It beats Granite 4.0-H-350M on most of the blog’s tests while being smaller: on IFEval it’s 71.71 against 61.27. For comparison, SmolLM2-Instruct scores about 56.7 on the same IFEval, although SmolLM2 is strong at instruction following for its size and even beats Qwen2.5-1.5B on that test. In its weight class, LFM2.5-230M looks like a strong player.

All the other “small” models people tend to mention are noticeably larger: Gemma 3 1B, Qwen3.5-0.8B, and Llama 3.2 1B run from 800M to a billion parameters, that is, 3.5–4.3 times bigger than LFM2.5-230M. So there aren’t many direct rivals in exactly its weight class, and most of them are very compact families like SmolLM2 and IBM’s Granite line.

The bottom line

LFM2.5-230M is Liquid AI’s bet on edge AI, that is, on AI that runs right on the device rather than in the cloud. The model is open, fast, light on memory, and good at tool use and data extraction. It’s part of a whole LFM2.5 family that includes base models for customization and specialized audio and vision variants on one architecture. It isn’t meant for reasoning or code, but as a lightweight engine for agentic tasks on a phone or single-board computer it looks convincing.