Your Mac Just Got a Brain: Run an AI Agent Locally That Controls Your Computer with a 4B Model

I was tinkering with yet another cloud API for my weekend project when it hit me—every screenshot, every interaction, every command I fired off was flying through someone else’s server. That’s when I stumbled upon something that feels almost too good to be true: a fully local GUI agent that runs entirely on your Mac, no internet required.

A few months back, the team behind Skill showed off how an open-source AI could control macOS by visually understanding any desktop interface—like a human looking at a screen, clicking buttons, typing text. Now they’ve pushed it further. They just released the companion on‑device model, cleverly named Mano‑P, alongside a custom inference framework called Cider. Both are open source. Together, they turn the whole “edge AI” dream from “yeah, it technically works” into “wait, it’s actually fast and usable.”

“Mano-P is a GUI‑VLA model—pure visual understanding of graphical interfaces. No CDP protocols, no HTML parsing. It just looks at a screenshot, finds what you want, and clicks or types.”

If you’ve ever tried to build an automation script for a desktop app, you know the pain: you hard‑code coordinates, rely on accessibility APIs that break with every update, or send everything to an external API that charges per image. Mano‑P skips all that. It’s trained on 60,000 GUI trajectories covering over 3 million actions across mainstream desktop and web workflows. That’s not cheap data, but they’ve released it openly.

Here’s the part that got me excited—performance numbers that actually make sense for a real Mac. On an M4 Pro, the 4‑bit quantized model achieves 476 tokens/s prefill and 76 tokens/s decode, with peak memory usage of just 4.3GB. That means a cheap 16GB MacBook Air could run this thing without breaking a sweat. In the CUA benchmark (computer use agent tasks), this tiny 4B model matches the accuracy of much larger cloud models—while keeping every screenshot and every task on your laptop.

And it gets better. The framework Cider is the secret sauce that makes Mano‑P fly. It extends Apple’s MLX ecosystem by adding something fundamental that was missing: activation quantization. MLX natively supports weight quantization (W4A16, W8A16), but it couldn’t quantize the intermediate activations. Cider taps into Apple’s Metal 4 API to enable hardware‑accelerated INT8 TensorOps on Apple GPUs for the first time.

“In W8A8 mode, operator speed improves 1.4x to 1.9x over native MLX, depending on batch size.”

Take Qwen3‑8B: native prefill runs at 1,695 tokens/s in FP16. With Cider’s W8A8, it jumps to 2,531 tokens/s—a 1.5x boost. Llama3‑8B goes from 1,727 to 2,520. For vision‑language models like Qwen3‑VL‑2B, chunked prefill gets a 57‑61% end‑to‑end improvement. The kicker? You integrate it with one line of Python—just convert_model(model) and it automatically switches between W8A8 for long sequences and INT8 matrix‑vector for single tokens. Dead simple.

What can you actually do with Mano‑P today? The examples they show are mind‑bending.

One scenario: fully automated application building. You describe the app you want in natural language. The system does requirement clarification, architecture design, code generation, and local deployment. Then it runs multi‑layer testing: first API endpoints, then LLM‑based visual checks on the UI, and finally end‑to‑end GUI automation tests using the VLA model itself. If something fails, the agent auto‑locates the bug, fixes the code, redeploys, and loops until everything passes. All without human hands.

Another example: a commercial video intelligence system that takes a single command, then autonomously generates, uploads, analyzes, edits, and re‑evaluates videos. It manipulates websites and editing software directly, handles files, tweaks subtitles, and spits out a report with both subjective scores and objective metrics.

The common thread: every screenshot, every click, every piece of interface data stays local. No cloud uploads. This matters because in a fully automated programming pipeline, GUI testing consumes over 59% of all cloud tokens. API tests can tell you if endpoints respond, but to really know if the software works, you need someone (or something) to open the UI and poke around. That process is inherently multimodal—constant screenshots, element location, action execution, result judgment. Mano‑P zeroes out that cost completely: no API bills, no data leakage.

Compare that to Claude Computer Use. In the OSWorld benchmark, Claude scores higher overall (72.1% vs. 58.2% for Mano‑P). But Claude requires cloud API calls; your screenshots and task data leave your machine. If you’re working with sensitive internal systems, patient data, or proprietary business workflows, the local approach isn’t just nice—it’s the only realistic option.

The team behind this (Mininglamp AI) plans to open‑source the training methodology for mano‑p models soon, so developers can fine‑tune their own custom GUI agents using private data. That’s the big picture: from Mano‑P and Cider, they’re building the foundational infrastructure for private, local AI that doesn’t need a cloud subscription to be useful.

I’ve been running Mano‑P on my M2 MacBook Air (yes, the one with 8GB RAM—it works, though barely fits in memory). The first time I asked it to “open a terminal and run ‘date’,” my jaw dropped. It actually did it. Slowly, but it did it. With Cider’s acceleration and future refinements, this could become the standard way to automate desktops without sacrificing privacy.

If you want to try it yourself:

brew tap HanningWang/tap && brew install mano-cua

Then: mano-cua run "Tell Alice the meeting is postponed to tomorrow 3pm"

Or integrate it as a Skill. The source code is on GitHub:

Mano-P: https://github.com/Mininglamp-AI/Mano-P
Cider: https://github.com/Mininglamp-AI/cider