Featured image for AI Voice Assistant: Build Your Own Jarvis (Python Tutorial 2026)
Tutorials · · 17 min read · Updated

AI Voice Assistant: Build Your Own Jarvis (Python Tutorial 2026)

Build a fully functional AI voice assistant with Python. Step-by-step tutorial covering speech recognition, GPT-5 integration, and text-to-speech for your own Jarvis.

voice-assistantpythonspeech-recognitiontext-to-speechopenaitutorialjarvis

Growing up watching Iron Man, I was fascinated by Jarvis. An AI assistant that could hear you, understand complex requests, and respond naturally seemed like pure science fiction.

Last weekend, I built my own version in about 200 lines of Python. It’s not quite running my hypothetical mansion, but it holds real conversations, answers questions, and even tells jokes when I ask. The technology that seemed impossible a decade ago is now genuinely accessible.

Let me show you exactly how to build this. By the end of this tutorial, you’ll have a working voice assistant that listens to your voice, thinks using GPT-5, and speaks responses back to you.

What Makes a Voice Assistant “Intelligent”

Before we write any code, let’s understand what we’re building. A modern voice assistant has three core components, each handling a different part of the interaction.

The Three Core Components

1. Speech Recognition (Hearing) Converting your spoken words into text. This used to require expensive specialized hardware and software. Now, APIs like OpenAI’s Whisper do it with remarkable accuracy.

2. Language Understanding (Thinking) Making sense of what you said and formulating a response. This is where GPT-5 shines. Unlike old-school voice assistants that matched keywords, GPT-5 genuinely understands context, nuance, and even ambiguity.

3. Text-to-Speech (Speaking) Converting text responses back into natural-sounding speech. Modern TTS has moved far beyond robotic voices—you can get speech that sounds genuinely human.

2026 vs Earlier Approaches

The voice assistants of 2020-2023 were mostly sophisticated pattern matchers. “Set a timer for 10 minutes” worked because engineers programmed that exact pattern.

Our 2026 approach is fundamentally different. GPT-5 doesn’t need pre-programmed patterns—it understands language. You can ask follow-up questions, change topics mid-conversation, or phrase things in unusual ways. The AI adapts.

This means our Jarvis can handle questions its creators never anticipated. That’s the magic of building on large language models.

Prerequisites and Setup

Let’s get your development environment ready.

System Requirements

  • Python 3.10 or newer (I’m using 3.12)
  • A working microphone (built-in laptop mic works fine)
  • Speakers or headphones (for hearing responses)
  • OpenAI API key (required for GPT-5 and Whisper)
  • Internet connection (for cloud APIs)

Optional but helpful:

  • Decent quality microphone for better recognition
  • ElevenLabs API key for premium voice quality

Installing Required Libraries

Create a new project directory and set up a virtual environment:

mkdir jarvis
cd jarvis
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install the core dependencies:

pip install openai speechrecognition pyaudio pyttsx3

Note on PyAudio: This library can be tricky to install depending on your system.

On macOS:

brew install portaudio
pip install pyaudio

On Windows: PyAudio often installs directly. If not, download the appropriate .whl file from Christoph Gohlke’s repository.

On Linux (Ubuntu/Debian):

sudo apt-get install python3-pyaudio portaudio19-dev
pip install pyaudio

Getting Your API Keys

You’ll need an OpenAI API key at minimum:

  1. Go to platform.openai.com
  2. Sign in or create an account
  3. Generate a new API key
  4. Copy it somewhere safe

Create a .env file in your project:

OPENAI_API_KEY=your-key-here

And install python-dotenv to load it:

pip install python-dotenv

Understanding the Architecture

Before diving into code, let’s visualize what we’re building.

The Voice Loop

Our assistant runs in a continuous loop:

1. LISTEN    → Capture audio from microphone
2. TRANSCRIBE → Convert speech to text (Whisper)
3. THINK     → Generate response (GPT-5)
4. SPEAK     → Convert text to audio (TTS)
5. REPEAT    → Go back to step 1

This loop continues until you say a shutdown phrase like “goodbye” or “exit.”

Component Connections

Here’s how the parts connect:

[Microphone] → [Speech Recognition] → [GPT-5] → [Text-to-Speech] → [Speaker]
     ↑                                                                  |
     └──────────────────────────────────────────────────────────────────┘
                              (Loop continues)

Each component is independent, which means you can swap parts. Don’t like OpenAI TTS? Use ElevenLabs. Want privacy? Replace GPT-5 with a local model via Ollama.

Cloud vs Local Processing

Cloud approach (what we’ll build first):

  • OpenAI Whisper for transcription
  • GPT-5 for intelligence
  • OpenAI TTS for speech
  • Requires internet, but best quality

Local approach (we’ll cover this too):

  • Whisper running locally or alternative STT
  • Ollama with Llama 4 for intelligence
  • pyttsx3 for speech
  • Works offline, completely private

I recommend starting with cloud—it’s easier to get working and demonstrates the full capability. Once that works, you can swap components for local alternatives.

Step 1: Speech Recognition

Let’s start by getting your computer to hear and transcribe speech.

Setting Up the Microphone

First, create a simple script to test your microphone:

import speech_recognition as sr

# Initialize recognizer
recognizer = sr.Recognizer()

# List available microphones
for index, name in enumerate(sr.Microphone.list_microphone_names()):
    print(f"{index}: {name}")

Run this to see your available audio input devices. Note the index of your preferred microphone.

Basic Speech-to-Text

Here’s a simple function to capture and transcribe speech:

import speech_recognition as sr

def listen():
    """Listen for speech and return transcribed text."""
    recognizer = sr.Recognizer()
    
    with sr.Microphone() as source:
        print("Listening...")
        
        # Adjust for ambient noise
        recognizer.adjust_for_ambient_noise(source, duration=0.5)
        
        try:
            # Listen for speech (with timeout)
            audio = recognizer.listen(source, timeout=5, phrase_time_limit=15)
            print("Processing...")
            
            # Use Google's free speech recognition
            text = recognizer.recognize_google(audio)
            return text
            
        except sr.WaitTimeoutError:
            return None
        except sr.UnknownValueError:
            print("Couldn't understand that.")
            return None
        except sr.RequestError as e:
            print(f"Speech recognition error: {e}")
            return None

# Test it
if __name__ == "__main__":
    while True:
        result = listen()
        if result:
            print(f"You said: {result}")

This uses Google’s free speech recognition service, which works reasonably well for testing. But we’ll upgrade to Whisper for production quality.

Using OpenAI Whisper API

OpenAI’s Whisper is significantly more accurate than free alternatives. Here’s how to use it:

import speech_recognition as sr
from openai import OpenAI
import io

client = OpenAI()  # Uses OPENAI_API_KEY from environment

def listen_whisper():
    """Listen and transcribe using OpenAI Whisper."""
    recognizer = sr.Recognizer()
    
    with sr.Microphone() as source:
        print("🎤 Listening...")
        recognizer.adjust_for_ambient_noise(source, duration=0.3)
        
        try:
            audio = recognizer.listen(source, timeout=5, phrase_time_limit=30)
            print("🔄 Transcribing...")
            
            # Convert to format Whisper expects
            wav_data = audio.get_wav_data()
            audio_file = io.BytesIO(wav_data)
            audio_file.name = "audio.wav"
            
            # Call Whisper API
            transcript = client.audio.transcriptions.create(
                model="whisper-1",
                file=audio_file,
                response_format="text"
            )
            
            return transcript.strip()
            
        except sr.WaitTimeoutError:
            return None
        except Exception as e:
            print(f"Error: {e}")
            return None

Whisper handles accents, background noise, and mumbling much better than alternatives. It’s what I use in my actual implementation.

Step 2: The AI Brain (OpenAI Integration)

Now let’s give our assistant intelligence.

Connecting to GPT-5

Here’s how to set up the conversation with GPT-5:

from openai import OpenAI

client = OpenAI()

# System prompt defines the assistant's personality
SYSTEM_PROMPT = """You are Jarvis, a helpful and intelligent personal assistant. 
You speak naturally and conversationally, like a knowledgeable friend.

Guidelines:
- Keep responses concise for voice interaction (2-3 sentences usually)
- Be helpful, friendly, and occasionally witty
- If you don't know something, say so honestly
- For complex questions, offer to elaborate if the user wants

You're having a voice conversation, so avoid bullet points or formatting.
Respond as if you're speaking out loud."""

def get_ai_response(user_input, conversation_history):
    """Get response from GPT-5."""
    
    # Add user message to history
    conversation_history.append({
        "role": "user",
        "content": user_input
    })
    
    # Build messages with system prompt
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    messages.extend(conversation_history[-10:])  # Last 10 exchanges
    
    response = client.chat.completions.create(
        model="gpt-5",
        messages=messages,
        max_tokens=300,  # Keep responses voice-friendly
        temperature=0.7
    )
    
    assistant_message = response.choices[0].message.content
    
    # Add assistant response to history
    conversation_history.append({
        "role": "assistant",
        "content": assistant_message
    })
    
    return assistant_message

Conversation Memory

Notice we’re maintaining conversation_history. This is crucial for natural conversation. Without it, every question would be isolated—the AI couldn’t remember what you said 30 seconds ago.

We keep the last 10 exchanges (20 messages) to balance memory with API costs and context limits.

# Conversation history persists across the loop
conversation_history = []

# Each interaction adds to history
response = get_ai_response("What's the capital of France?", conversation_history)
# conversation_history now has 2 messages

response = get_ai_response("What's its population?", conversation_history)
# AI knows "its" refers to Paris because it has context

Intent Understanding

One thing I love about GPT-5: you don’t need to code intent detection separately. Ask it to set a reminder, and it understands. Ask it a science question, and it switches modes naturally.

You can even extend this with function calling for structured actions:

# Example: Let GPT-5 decide when to perform actions
SYSTEM_PROMPT += """

When the user asks for:
- Current time: Include [TIME_REQUEST] in your response
- Weather: Include [WEATHER_REQUEST] in your response
- To exit: Include [EXIT] in your response

I'll handle these special requests automatically."""

This lets you detect when specific actions are needed without complex parsing.

Step 3: Text-to-Speech

Now let’s make our assistant speak.

Offline Option: pyttsx3

For a quick, free, offline solution:

import pyttsx3

def speak_offline(text):
    """Speak text using local TTS engine."""
    engine = pyttsx3.init()
    
    # Optional: customize voice
    voices = engine.getProperty('voices')
    engine.setProperty('voice', voices[0].id)  # Try different indices
    engine.setProperty('rate', 175)  # Speed (words per minute)
    
    engine.say(text)
    engine.runAndWait()

pyttsx3 works offline and responds instantly, but the voice quality is noticeably robotic. Fine for testing, but not for daily use.

Cloud Option: OpenAI TTS

OpenAI’s TTS sounds remarkably natural:

from openai import OpenAI
import io
from playsound import playsound
import tempfile

client = OpenAI()

def speak_openai(text):
    """Speak text using OpenAI TTS."""
    response = client.audio.speech.create(
        model="tts-1",
        voice="alloy",  # Options: alloy, echo, fable, onyx, nova, shimmer
        input=text
    )
    
    # Save to temp file and play
    with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f:
        f.write(response.content)
        temp_path = f.name
    
    playsound(temp_path)
    
    # Clean up
    import os
    os.unlink(temp_path)

The alloy voice is my favorite—natural and clear. nova is warmer, onyx is deeper. Try them all.

For slightly faster playback, use tts-1. For higher quality, use tts-1-hd (slower but better for complex speech).

Premium Option: ElevenLabs

For the most human-like voices, ElevenLabs is unmatched:

import requests
import tempfile
from playsound import playsound

ELEVENLABS_API_KEY = "your-key-here"

def speak_elevenlabs(text, voice_id="21m00Tcm4TlvDq8ikWAM"):
    """Speak text using ElevenLabs (premium quality)."""
    url = f"https://api.elevenlabs.io/v1/text-to-speech/{voice_id}"
    
    headers = {
        "xi-api-key": ELEVENLABS_API_KEY,
        "Content-Type": "application/json"
    }
    
    data = {
        "text": text,
        "model_id": "eleven_monolingual_v1",
        "voice_settings": {
            "stability": 0.5,
            "similarity_boost": 0.8
        }
    }
    
    response = requests.post(url, json=data, headers=headers)
    
    if response.status_code == 200:
        with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f:
            f.write(response.content)
            temp_path = f.name
        
        playsound(temp_path)
        import os
        os.unlink(temp_path)

ElevenLabs charges per character, so it’s not ideal for heavy use. But for a personal assistant you use occasionally, it’s worth it.

Putting It All Together

Now let’s combine everything into a working assistant.

The Complete Voice Loop

Here’s the full implementation:

#!/usr/bin/env python3
"""
Jarvis - Personal AI Voice Assistant
Built with OpenAI's GPT-5, Whisper, and TTS
"""

import speech_recognition as sr
from openai import OpenAI
import io
import tempfile
import os
from playsound import playsound
from dotenv import load_dotenv

# Load API key from .env
load_dotenv()
client = OpenAI()

# System prompt for our assistant's personality
SYSTEM_PROMPT = """You are Jarvis, a helpful personal AI assistant.
You're having a voice conversation, so keep responses concise and natural.
Be knowledgeable, friendly, and occasionally witty—but never robotic."""

conversation_history = []


def listen():
    """Listen for speech and return transcribed text."""
    recognizer = sr.Recognizer()
    
    with sr.Microphone() as source:
        print("\n🎤 Listening...")
        recognizer.adjust_for_ambient_noise(source, duration=0.3)
        
        try:
            audio = recognizer.listen(source, timeout=5, phrase_time_limit=30)
            print("🔄 Processing...")
            
            # Use Whisper for transcription
            wav_data = audio.get_wav_data()
            audio_file = io.BytesIO(wav_data)
            audio_file.name = "audio.wav"
            
            transcript = client.audio.transcriptions.create(
                model="whisper-1",
                file=audio_file,
                response_format="text"
            )
            
            return transcript.strip()
            
        except sr.WaitTimeoutError:
            return None
        except Exception as e:
            print(f"Listen error: {e}")
            return None


def think(user_input):
    """Process input and generate response using GPT-5."""
    global conversation_history
    
    conversation_history.append({"role": "user", "content": user_input})
    
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    messages.extend(conversation_history[-10:])
    
    response = client.chat.completions.create(
        model="gpt-5",
        messages=messages,
        max_tokens=300,
        temperature=0.7
    )
    
    assistant_message = response.choices[0].message.content
    conversation_history.append({"role": "assistant", "content": assistant_message})
    
    return assistant_message


def speak(text):
    """Convert text to speech and play it."""
    print(f"🔊 Jarvis: {text}")
    
    response = client.audio.speech.create(
        model="tts-1",
        voice="alloy",
        input=text
    )
    
    with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f:
        f.write(response.content)
        temp_path = f.name
    
    playsound(temp_path)
    os.unlink(temp_path)


def main():
    """Main voice assistant loop."""
    print("=" * 50)
    print("   🤖 JARVIS - Personal AI Assistant")
    print("   Say 'goodbye' or 'exit' to quit")
    print("=" * 50)
    
    speak("Hello! I'm Jarvis. How can I help you today?")
    
    while True:
        try:
            # Listen
            user_input = listen()
            
            if not user_input:
                continue
            
            print(f"📝 You: {user_input}")
            
            # Check for exit commands
            if any(word in user_input.lower() for word in ["goodbye", "exit", "quit", "stop"]):
                speak("Goodbye! Have a great day.")
                break
            
            # Think
            response = think(user_input)
            
            # Speak
            speak(response)
            
        except KeyboardInterrupt:
            print("\n\nShutting down...")
            break
        except Exception as e:
            print(f"Error: {e}")
            speak("I encountered an error. Let me try again.")


if __name__ == "__main__":
    main()

Save this as jarvis.py and run it:

python jarvis.py

Speak naturally, and your Jarvis will respond.

Wake Word Detection

Real assistants don’t listen constantly—they wait for wake words like “Hey Siri” or “Alexa.” Here’s a simple implementation:

WAKE_WORDS = ["jarvis", "hey jarvis", "okay jarvis"]

def wait_for_wake_word():
    """Wait for wake word before full listening."""
    print("💤 Waiting for wake word...")
    
    while True:
        text = listen()
        if text:
            text_lower = text.lower()
            for wake_word in WAKE_WORDS:
                if wake_word in text_lower:
                    # Remove wake word from the request
                    command = text_lower.replace(wake_word, "").strip()
                    if command:
                        return command
                    else:
                        speak("Yes?")
                        return listen()

Integrate this into your main loop for a more authentic assistant experience.

Error Handling and Recovery

Production-ready code needs robust error handling:

import time

def robust_listen(max_retries=3):
    """Listen with automatic retry on failure."""
    for attempt in range(max_retries):
        try:
            result = listen()
            if result:
                return result
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            time.sleep(1)
    return None


def robust_think(user_input, max_retries=2):
    """Think with automatic retry on API failure."""
    for attempt in range(max_retries):
        try:
            return think(user_input)
        except Exception as e:
            print(f"API error: {e}")
            if attempt < max_retries - 1:
                time.sleep(2)
    return "I'm having trouble connecting right now. Please try again."

Adding Practical Features

Let’s extend our assistant with useful capabilities.

Weather and Time

Add utility functions:

from datetime import datetime

def get_current_time():
    """Get current time in natural format."""
    now = datetime.now()
    return now.strftime("It's %I:%M %p on %A, %B %d")


def handle_special_commands(text):
    """Check for special commands before sending to GPT."""
    text_lower = text.lower()
    
    if "what time" in text_lower or "current time" in text_lower:
        return get_current_time()
    
    return None  # Let GPT handle it

Web Search Integration

For current information, integrate web search:

def search_web(query):
    """Search the web for current information."""
    # Option 1: Use Perplexity API
    # Option 2: Use Brave Search API  
    # Option 3: Use Tavily
    
    # This enables Jarvis to answer questions about current events
    pass

Plug this into your GPT-5 function calling for real-time information.

Custom Commands

Create an extensible command system:

CUSTOM_COMMANDS = {
    "open browser": lambda: os.system("open -a 'Safari'"),
    "play music": lambda: os.system("open -a 'Music'"),
    "take a note": lambda text: save_note(text),
}

def check_custom_command(text):
    """Check if text matches a custom command."""
    text_lower = text.lower()
    for trigger, action in CUSTOM_COMMANDS.items():
        if trigger in text_lower:
            action()
            return True
    return False

Running Locally with Ollama

For complete privacy, replace cloud APIs with local alternatives.

Why Go Local

  • Privacy: Nothing leaves your machine
  • No API costs: Completely free
  • Works offline: Airport, cabin, anywhere
  • No rate limits: Use as much as you want

Swapping OpenAI for Ollama

First, install Ollama and pull a model:

ollama pull llama4

Then modify the think function:

import requests

def think_local(user_input):
    """Process input using local Ollama."""
    global conversation_history
    
    conversation_history.append({"role": "user", "content": user_input})
    
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    messages.extend(conversation_history[-10:])
    
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": "llama4",
            "messages": messages,
            "stream": False
        }
    )
    
    assistant_message = response.json()["message"]["content"]
    conversation_history.append({"role": "assistant", "content": assistant_message})
    
    return assistant_message

For local STT, you can run Whisper locally or use Vosk. For local TTS, pyttsx3 works without internet.

Check our guide on running AI locally with Ollama for the complete setup.

Performance Optimization

Reducing Latency

The biggest delay is usually TTS. Stream audio for faster response:

def speak_streaming(text):
    """Stream audio for faster playback start."""
    response = client.audio.speech.create(
        model="tts-1",
        voice="alloy",
        input=text,
    )
    
    # Play while downloading (requires different audio library)
    # Implementation depends on your audio library choice
    pass

For production, consider using WebSocket connections for real-time streaming.

Memory Management

Trim conversation history periodically:

def trim_history(max_messages=20):
    """Keep conversation history manageable."""
    global conversation_history
    if len(conversation_history) > max_messages:
        conversation_history = conversation_history[-max_messages:]

This prevents context from growing indefinitely.

Troubleshooting Common Issues

PyAudio Installation Fails This is the most common issue. On macOS, ensure you’ve installed portaudio via Homebrew first. On Windows, try the prebuilt wheel from Christoph Gohlke’s site. On Linux, install the system portaudio dev package.

Microphone Not Detected Run the microphone listing code to see available devices. Try specifying the device index explicitly in sr.Microphone(device_index=X).

“API Key Invalid” Double-check your .env file exists in the project root and contains the correct key. Make sure python-dotenv is installed and load_dotenv() is called before creating the OpenAI client.

Audio Doesn’t Play The playsound library can be finicky. Try pip install playsound==1.2.2 (older version is more stable). Alternatively, use pygame or simpleaudio for playback.

Responses Are Slow Network latency is usually the culprit. Consider using tts-1 instead of tts-1-hd, reducing max_tokens, or switching to local processing via Ollama.

Frequently Asked Questions

Can I use this completely offline?

Yes, but you’ll need to:

  1. Run Whisper locally (or use Vosk for STT)
  2. Use Ollama with a local model instead of GPT-5
  3. Use pyttsx3 for offline TTS

Quality will be lower, but it works without internet.

What’s the cost to run this?

Using OpenAI APIs with moderate use:

  • Whisper: ~$0.006 per minute of audio
  • GPT-5: ~$0.03 per 1K tokens (varies by response length)
  • TTS: ~$0.015 per 1K characters

For casual personal use, expect $5-15/month. Heavy users might spend more.

How do I add custom commands?

Extend the CUSTOM_COMMANDS dictionary or use GPT function calling for more complex actions. See the custom commands section above.

Is it as good as Alexa or Siri?

In terms of conversation quality? Often better—GPT-5 is more capable than the models powering most commercial assistants. But Alexa and Siri have deeper smart home integrations, more reliable wake word detection, and years of optimization. Our Jarvis excels at knowledge and conversation; commercial assistants excel at home automation.

Can I use a different LLM?

Absolutely. The code works with any OpenAI-compatible API:

  • Anthropic’s Claude via their API
  • Local models via Ollama (Llama 4, Mistral, etc.)
  • Other providers that offer OpenAI-compatible endpoints

Just change the model name and potentially the API endpoint.

Conclusion

We built a genuinely functional AI voice assistant in about 200 lines of Python. It can:

  • Listen to natural speech
  • Understand complex questions and conversation
  • Respond with natural-sounding voice
  • Remember context across a conversation

This isn’t a toy demo—it’s a working system you can use daily. I’ve been using my version for quick questions, brainstorming, and even as a coding companion (I disable TTS and just read responses in the terminal for that).

The technology that seemed like science fiction when we watched Iron Man is now accessible to anyone who can install Python and get an API key.

Where to go from here:

Your move. Build something that talks back.

Last updated: January 11, 2026

Found this helpful? Share it with others.

Vibe Coder avatar

Vibe Coder

AI Engineer & Technical Writer
5+ years experience

AI Engineer with 5+ years of experience building production AI systems. Specialized in AI agents, LLMs, and developer tools. Previously built AI solutions processing millions of requests daily. Passionate about making AI accessible to every developer.

AI Agents LLMs Prompt Engineering Python TypeScript