Ollama Setup Guide: Run AI Models Locally Fast

Ollama is a free, open-source tool that lets you download and run large language models like Llama 3, Mistral, Gemma, and Phi locally on your own computer without sending any data to external servers. You install it with one command, pull a model with a second command, and start chatting with a local AI in under five minutes. No API keys, no subscriptions, no cloud dependency, and complete privacy for every conversation.

Running AI locally was reserved for machine learning engineers with custom Python environments until Ollama simplified the entire process into something closer to installing a regular app. The project crossed 100,000 GitHub stars in 2025 and became the default way the self-hosting community runs local LLMs. Whether you want a private ChatGPT replacement, a coding assistant that works offline, or an AI engine for your home automation system, Ollama is the starting point. Here is the complete setup for Windows, Mac, and Linux, plus how to add a web interface and connect it to other tools.

What Ollama Actually Does

Ollama is a model runtime, not a model itself. It handles the complex work of downloading quantized model files, loading them into memory (RAM or VRAM), managing the inference server, and exposing a local API that other applications can talk to. Think of Ollama as Docker for AI models: it pulls models from a registry, runs them in an optimized environment, and exposes them through a standardized interface.

The models themselves are open-weight large language models published by Meta (Llama 3, Llama 3.1), Mistral AI (Mistral, Mixtral), Google (Gemma 2), Microsoft (Phi-3), and others. These models range from 1 billion to 70+ billion parameters. Smaller models (7B-8B parameters) run on consumer hardware with 8GB RAM. Larger models (70B+) need 32GB+ RAM or a dedicated GPU with sufficient VRAM.

Every conversation stays on your machine. No data leaves your local network. No tokens are counted against any API quota. You can ask the model anything, including proprietary business questions, personal health queries, or creative writing prompts, with zero privacy risk. The model runs entirely on your CPU or GPU, processing tokens at speeds that depend on your hardware rather than a cloud provider’s server load.

System Requirements for Running Local AI

Ollama runs on Windows 10/11, macOS 12+, and Linux. The hardware requirements depend entirely on which model you want to run. Smaller models work on surprisingly modest hardware, while frontier-class models need serious compute.

For 7B-8B parameter models (Llama 3.1 8B, Mistral 7B, Gemma 2 9B): minimum 8GB RAM, any modern CPU (Intel i5/AMD Ryzen 5 from 2018 onward). These models generate 5 to 15 tokens per second on CPU-only systems, which feels like watching someone type quickly. A dedicated GPU (NVIDIA RTX 3060 or better with 8GB+ VRAM) increases speed to 30 to 60 tokens per second, which feels near-instant.

For 13B-14B parameter models (Llama 3.1 13B, Phi-3 14B): minimum 16GB RAM. CPU inference is noticeably slower (3 to 8 tokens/second). A GPU with 12GB+ VRAM (RTX 3060 12GB, RTX 4070) keeps responses fast. These models offer noticeably better reasoning and writing quality than 7B models.

For 70B parameter models (Llama 3.1 70B, Mixtral 8x7B): minimum 48GB RAM for CPU inference (very slow, 1 to 3 tokens/second) or a GPU with 40GB+ VRAM (RTX 4090, A100). Most home users skip 70B models unless they have workstation-class hardware. The quality jump from 13B to 70B is substantial for complex reasoning tasks.

Apple Silicon Macs (M1, M2, M3, M4) are exceptionally good for local AI because they share unified memory between CPU and GPU. An M2 MacBook Air with 16GB unified memory runs 7B models at 20 to 30 tokens/second and 13B models at 10 to 15 tokens/second, matching or exceeding dedicated NVIDIA GPUs in the same price range for these workloads.

Installing Ollama on Any Operating System

macOS: Download the Ollama installer from ollama.com and run it like any Mac application. The installer adds the ollama command to your terminal and starts the Ollama server as a background service. After installation, open Terminal and run the command to pull your first model.

Windows: Download the Windows installer from ollama.com. Run the .exe file, follow the installation wizard, and Ollama installs as a Windows service that starts automatically. Open PowerShell or Command Prompt and the ollama command is available immediately. Ollama on Windows supports NVIDIA GPU acceleration out of the box if you have current NVIDIA drivers installed.

Linux: Run the one-line install script from your terminal. The script detects your distribution (Ubuntu, Debian, Fedora, Arch), installs dependencies, downloads the Ollama binary, and configures it as a systemd service. NVIDIA GPU support requires the NVIDIA Container Toolkit or standard NVIDIA drivers. AMD GPU support (ROCm) is available for supported Radeon cards.

Docker: Ollama publishes an official Docker image. Run it with a Docker command that maps the models directory and exposes port 11434. For Docker Compose setups, add an Ollama service to your existing compose file. The Docker approach is ideal for headless servers and NAS devices where you want Ollama integrated with your existing container stack.

Downloading and Running Your First Model

Ollama’s model library at ollama.com/library lists every available model with descriptions, parameter counts, and hardware recommendations. After installing Ollama, pulling and running a model takes one terminal command.

Start with Llama 3.1 8B, the best all-around model for most hardware. Run the pull command in your terminal specifying llama3.1 as the model name. Ollama downloads the quantized model file (approximately 4.7GB for the default Q4_0 quantization). After the download completes, run the model to start an interactive chat session directly in your terminal.

Type any question or prompt, and the model responds. The first response takes 2 to 5 seconds as the model loads into memory (subsequent responses start faster). Try asking it to explain a concept, write code, summarize text, or answer a question. Exit the session by typing /bye.

Recommended models to try next: Mistral 7B (excellent at instruction following, slightly more concise than Llama), Gemma 2 9B (Google’s model, strong at reasoning and factual accuracy), CodeLlama 7B (specialized for code generation and debugging), and Phi-3 3.8B (Microsoft’s small model that runs fast on anything and is surprisingly capable for its size).

Adding a Web Interface With Open WebUI

Ollama’s terminal interface works but lacks the polish of ChatGPT’s web interface. Open WebUI (formerly Ollama-WebUI) provides a browser-based chat interface that looks and feels like ChatGPT, complete with conversation history, model switching, file uploads, and multi-user support. It connects directly to your local Ollama instance.

Deploy Open WebUI with Docker using a single command that links it to your Ollama server. Open WebUI runs on port 3000 by default and is accessible from any browser on your network. Create an admin account on first login, then start chatting with any model you have pulled in Ollama.

Open WebUI features that make it a genuine ChatGPT alternative: conversation history persists across sessions, you can switch between models mid-conversation to compare responses, file upload lets you analyze documents locally, the playground mode allows adjusting temperature and other generation parameters, and multi-user mode lets family members have separate accounts with their own conversation histories.

For the cleanest setup, run both Ollama and Open WebUI in a single Docker Compose file. Define both services, link them through Docker networking, and bring up the entire local AI stack with one command. The compose file also makes updates trivial: pull new images and restart the containers.

Useful Ollama Commands and Configuration

The ollama command-line tool provides everything you need to manage models and configure the server.

List installed models with the list command to see every model on your system with its size and last modified date. Remove models you no longer use with the rm command followed by the model name to free disk space. Show detailed model information (parameter count, quantization level, template format) with the show command.

Create custom models using a Modelfile, which is similar to a Dockerfile. A Modelfile specifies a base model, a custom system prompt, adjusted temperature settings, and stop tokens. This lets you create specialized AI assistants: a coding helper with a system prompt focused on Python, a writing editor with instructions to improve prose, or a home automation advisor that understands your specific smart home setup.

The Ollama API runs on localhost port 11434 and follows the OpenAI-compatible API format. Any application that supports the OpenAI API can connect to Ollama by changing the base URL from api.openai.com to localhost:11434. This compatibility means Ollama works with hundreds of existing AI tools, browser extensions, IDE plugins, and automation platforms without any modification.

Connecting Ollama to Home Assistant and Other Tools

Ollama’s OpenAI-compatible API makes it a drop-in replacement for cloud AI services in many applications. Several self-hosted tools connect to Ollama directly.

Home Assistant supports Ollama through its Conversation integration. Configure Home Assistant to use your local Ollama instance as its AI backend for voice commands and text-based interactions. Your smart home commands are processed entirely on your local network, with zero cloud dependency for AI interpretation.

Continue (VS Code and JetBrains extension) connects to Ollama for local AI code completion and chat. Write code with AI assistance that runs on your own hardware, keeping proprietary code private. CodeLlama and DeepSeek Coder models work particularly well for this use case.

n8n and Node-RED workflow automation platforms can call Ollama’s API to add AI processing to automated workflows: summarize emails, categorize incoming documents, generate responses, or analyze data. All processing happens locally, which is critical for workflows handling sensitive business or personal data.

Performance Optimization Tips

Getting the fastest possible inference speed from Ollama requires matching your model choice to your hardware capabilities and tuning a few configuration options.

Use quantized models appropriate for your RAM. Q4_0 quantization (default) provides the best balance of quality and speed for most hardware. Q5_K_M and Q6_K offer slightly better quality at the cost of larger file size and slower inference. Q2_K and Q3_K reduce quality noticeably but allow larger models to fit in limited RAM.

On NVIDIA GPUs, verify that Ollama detects your GPU by checking the output when running a model. The log should indicate CUDA acceleration and show your GPU model. If Ollama falls back to CPU inference despite having an NVIDIA GPU, update your NVIDIA drivers to the latest version and ensure the CUDA toolkit is installed.

On Apple Silicon, Ollama automatically uses the Metal GPU framework for acceleration. Performance scales with unified memory size: 8GB handles 7B models comfortably, 16GB handles 13B models well, 32GB runs some 30B models, and 64GB+ handles 70B models. The M3 Pro/Max and M4 Pro/Max chips with higher memory bandwidth deliver notably faster token generation than base M-series chips.

Is Ollama free to use?

Ollama is 100 percent free and open-source under the MIT license. There are no subscription fees, usage limits, or premium tiers. The AI models it runs (Llama 3, Mistral, Gemma) are also free to download and use. The only cost is the electricity to run your computer, which is negligible for casual use.

Can Ollama replace ChatGPT?

For many tasks, yes. Local 8B and 13B models handle conversation, writing assistance, code generation, and question answering competently. They fall short of GPT-4 class models on complex multi-step reasoning and very long context tasks. For privacy-sensitive work, coding assistance, and general chat, local models through Ollama are a practical ChatGPT replacement.

How much disk space do AI models need?

Model sizes depend on parameter count and quantization. Typical sizes: 7B-8B models need 4 to 5GB, 13B models need 7 to 8GB, 34B models need 18 to 20GB, and 70B models need 38 to 40GB. You can install multiple models simultaneously since Ollama only loads the active model into RAM. Store models on an SSD for faster loading times.

Does Ollama work without internet?

Yes, after downloading your models. Ollama only needs internet to pull model files from the registry. Once downloaded, models run entirely offline with zero network dependency. This makes Ollama ideal for air-gapped environments, travel without WiFi, and privacy-critical applications where network isolation is required.

Which Ollama model is best for coding?

CodeLlama 13B and DeepSeek Coder V2 16B are the strongest coding models that run on consumer hardware. CodeLlama handles Python, JavaScript, C++, and general programming well. DeepSeek Coder excels at code completion, bug fixing, and explaining code. For lighter hardware, Phi-3 3.8B provides surprisingly good coding assistance at a fraction of the resource cost.