Deploying OpenAI’s GPT-OSS Locally with Ollama

Name: Viktor Zinchenko

Updated on 8/6/2025

OpenAI’s new GPT-OSS-120B and GPT-OSS-20B models push the frontier of open-source AI, delivering strong real-world performance at low cost. The 120B model achieves near-parity with OpenAI’s proprietary o4-mini model on core reasoning benchmarks (i.e. almost GPT-4 level reasoning) while the 20B model performs similarly to an o3-mini model – all without needing cloud servers. Importantly, these models are “open-weight,” meaning you can download the weights and run them locally on your own hardware. In this tutorial, we’ll walk through deploying GPT-OSS on a local machine using Ollama, a handy tool for running large language models (LLMs) offline.

Model Sizes and Hardware Requirements

GPT-OSS comes in two sizes: gpt-oss-20b (20 billion parameters) and gpt-oss-120b (120 billion parameters). OpenAI optimized both with a special 4.25-bit quantization (MXFP4) to drastically reduce their memory footprint. Thanks to this, the 20B model can run on systems with roughly 16 GB of memory, and the 120B model can fit in about 80 GB of memory. In practice, OpenAI recommends ~16 GB of VRAM (or unified memory) for the 20B model – perfect for higher-end consumer GPUs or Apple Silicon Macs – and at least ~60–80 GB for the 120B model.

Note: Apple’s M-series Macs are excellent for local LLMs because their unified memory means the GPU can utilize the full system RAM. For example, a MacBook with 32 GB unified memory can comfortably run the 20B model, and a Mac Studio with 64–128 GB might even handle the 120B model. On Windows/Linux PCs, a high-VRAM GPU (e.g. 24 GB RTX 4090) can run the 20B model, whereas the 120B model would require an 80 GB A100 or splitting across multiple GPUs (or falling back to CPU with very large RAM, albeit much slower).

Installing Ollama

Ollama is a free, open-source runtime that makes it easy to download and run LLMs locally. It supports macOS, Windows, and Linux. To set it up:

macOS: Download the Ollama app from the official website and run the installer. This will install the Ollama desktop (which also includes a CLI tool).
Windows: Download the Windows installer from Ollama’s site and follow the setup to install the Ollama runtime.
Linux: Install via the one-line script. For example, on Ubuntu you can run:
```
curl -fsSL https://ollama.com/install.sh | sh
```
This script downloads and installs the Ollama CLI and server on your system.

Once installed, you can run ollama commands from your terminal. It’s a good idea to verify the installation by running ollama --version or simply ollama to see the available commands. You should see subcommands like ollama pull, ollama run, ollama serve, etc., which we’ll use shortly.

Downloading the GPT-OSS Models

With Ollama set up, the next step is to download the GPT-OSS model weights. OpenAI has made both the 20B and 120B models freely available for download. You can obtain them through Ollama’s built-in model registry. There are two ways to get the models: pulling in advance or letting Ollama fetch them on first run.

1. Pull the models explicitly (optional): Ollama allows you to pull a model by name. This will download the weights so they’re ready for use. In a terminal, run:

ollama pull gpt-oss:20b    # Download the 20B model (~13–14 GB download)
ollama pull gpt-oss:120b   # Download the 120B model (~65 GB download)

You’ll see progress bars as each model file downloads and unpacks. Once done, you can confirm by listing installed models:

ollama list

This should show entries for gpt-oss:20b and gpt-oss:120b with their sizes (approximately 13 GB for 20B and 65 GB for 120B in quantized form).

2. Let ollama run auto-download: You can also skip the manual pull – Ollama will automatically fetch a model the first time you run it. For example, if you directly execute ollama run gpt-oss:20b, it will detect that the model isn’t present and proceed to download it for you. This one-step approach is convenient if you just want to jump straight into using the model.

💡 Tip: The 20B model is much smaller and faster to download, so you might start with that to verify everything works. The 120B model is huge; ensure you have sufficient disk space and patience (it’s tens of GBs) before pulling it. The Apache 2.0 license means you’re free to use and even fine-tune these weights in your own projects.

Running GPT-OSS with Ollama (CLI Usage)

Now for the fun part – running the model and chatting with it! Ollama can either run models on-demand in your terminal or host them as a local service. We’ll start with the simple interactive CLI usage.

1. Launch an interactive session: In your terminal, run the 20B model by executing:

ollama run gpt-oss:20b

After a moment (as the model loads), you should see a >>> prompt, indicating the model is ready for input. At this point, you can type a question or prompt for GPT-OSS to answer. For example, you might ask it to solve a creative riddle or summarize a document. After you hit enter, the model will print “Thinking…” as it processes your request, then output a detailed response.

Example: After running ollama run gpt-oss:20b, you might see: >>> (waiting for input) You: “Explain the significance of the moon landing in a poetic tone.” (Model thinks…) GPT-OSS: “The moon landing marked a giant leap for all humankind, a night where dreams left footprints on lunar soil…” (and so on, in a rich, poetic explanation.)

The first response may take a bit of time (especially on 20B running CPU-only, or if your GPU is borderline), but subsequent queries will be faster once the model is loaded. Despite its smaller size, GPT-OSS-20B already demonstrates strong reasoning and eloquence, thanks to the fine-tuning from OpenAI. For more challenging queries (complex reasoning, code execution, etc.), the 120B model will produce even more powerful results – albeit with higher memory and compute demands.

2. Try the 120B model (if you have the resources): If your system meets the requirements for the larger model, you can similarly start it with:

ollama run gpt-oss:120b

You’ll enter an interactive prompt again. The gpt-oss-120b model is designed for “frontier” performance – it can follow complex instructions, perform chain-of-thought reasoning, and even handle tool use (e.g. making web requests or running code) in an agentic manner. OpenAI notes that gpt-oss-120b nearly matches the performance of a scaled-down GPT-4, but can run on a single high-end GPU or advanced workstation. If you try a complex prompt (say, a multi-step problem or a request to use tools), you’ll see it reason through the steps thanks to the model’s chain-of-thought output.

3. Exiting: To exit the interactive chat, you can usually press Ctrl+C or type exit depending on how Ollama’s CLI is designed. (With ollama run, Ctrl+C should stop the model.)

4. Using ollama serve (optional): If you want the model to stay loaded and be accessible for multiple queries or from other applications, consider running ollama serve. This command starts the Ollama server in the background. By default it listens on a localhost port (e.g. localhost:11434). Once running, you can still chat via the CLI (using ollama run will connect to the server), but more importantly you can point other tools or APIs to this server to use GPT-OSS.

Using a Chat UI for Better UX

Interacting through the terminal is straightforward, but a graphical chat interface can greatly improve the user experience. Fortunately, there are open-source chat UIs that can hook into your local Ollama instance. One such example is LobeChat – an elegant, modern chat UI that supports multiple AI backends, including local Ollama models.

LobeChat: This is an open-source chat application that lets you chat with various AI models through a nice UI. It natively supports connecting to an Ollama server, which means once you have ollama serve running GPT-OSS on your machine, you can use LobeChat as the front-end. In LobeChat’s settings, you would choose Ollama as the provider, and it will use your local GPT-OSS model for conversation. The interface offers chat history, prompt templates, and other convenient features that a terminal can’t provide. (LobeChat even supports speech synthesis, multi-modal inputs, and plugins, offering a ChatGPT-like experience entirely locally.)
Other UI options: The ecosystem of local LLM UIs is growing. For instance, Open WebUI (a web-based interface originally built for Ollama) or projects like Text Generation WebUI can also connect to local models. Some community tools are cross-platform desktop apps that automatically detect Ollama models. While setting these up is beyond the scope of this tutorial, it’s good to know you’re not limited to the command line. A bit of configuration can give you a full chat application experience with GPT-OSS running on your hardware.

Using a chat UI doesn’t change anything about how the model runs – it’s still all local and private – but it makes interacting with the AI more intuitive (buttons, text boxes, conversation threads, etc.). Whether you use the terminal or a UI, GPT-OSS can now serve as your personal AI assistant without any cloud dependency.

Conclusion

In this post, we introduced GPT-OSS, OpenAI’s latest open-weight models, and demonstrated how to deploy them locally using Ollama. In summary, you installed the Ollama runtime, downloaded the GPT-OSS-20B (and optionally 120B) model, and ran it on your device – effectively turning your computer into a ChatGPT-like service. We also discussed how different model sizes have different hardware needs (20B being accessible to enthusiasts with a decent PC or Mac, and 120B requiring more advanced memory/GPU capacity). Finally, we touched on using a chat UI like LobeChat to interact with the model in a more user-friendly way.

GPT-OSS opens a new era of local AI development – you can experiment with a powerful language model on your own machine, fine-tune it to your domain, or integrate it into applications, all without relying on an external API. Best of all, because it’s open-weight and Apache-licensed, developers and researchers are free to build on it and share improvements. With tools like Ollama simplifying deployment, running a cutting-edge 120B-parameter model at home is no longer science fiction – it’s a tutorial away. Happy hacking with GPT-OSS!

Sources: The details and commands above were based on OpenAI's official GPT-OSS announcement, Ollama's documentation, and community guides. Enjoy your journey with local LLMs!

Top 10 Vibe Coding Tools in 2025 Agent GPT vs AutoGPT: Which One Shall You Choose?