How to Integrate LLMs into Your Python Project Without Breaking Existing Code

How to Integrate LLMs into Your Python Project Without Breaking Existing Code

Large language models (LLMs) are increasingly part of modern development workflows, but adding them to an existing Python project introduces real challenges. You need to maintain backwards compatibility, manage API costs, handle failures gracefully, and avoid the temptation to replace working code with untested LLM outputs.

This guide walks you through a production-ready approach to LLM integration that follows the principle of incremental adoption rather than wholesale replacement.

Understanding the LLM Integration Problem

Many developers approach LLM integration like a silver bullet—expecting it to solve productivity problems across their entire codebase. As Fred Brooks noted in "No Silver Bullet," no single technology solves the fundamental complexities of software development. The same applies to LLMs.

When integrating LLMs into existing Python projects, you're solving a specific problem: which parts of your codebase benefit from language model capabilities, and which parts should remain untouched?

Common integration pitfalls include:

  • Replacing stable, tested code with LLM-generated alternatives
  • Creating hard dependencies on external APIs without fallbacks
  • Ignoring token costs and API rate limits
  • Mixing LLM logic throughout your codebase instead of isolating it
  • Skipping validation of LLM outputs before using them

Strategy 1: Isolation Through Abstraction

Start by creating an abstraction layer that isolates LLM concerns from your core application logic.

# llm_service.py
from abc import ABC, abstractmethod
from typing import Optional
import anthropic

class CodeGenerationService(ABC):
    """Abstract base for code generation—allows swapping implementations."""
    
    @abstractmethod
    def generate_docstring(self, code: str) -> Optional[str]:
        """Generate docstring for a function. Returns None if LLM unavailable."""
        pass
    
    @abstractmethod
    def is_available(self) -> bool:
        """Check if the service is operational."""
        pass


class AnthropicDocstringGenerator(CodeGenerationService):
    """Claude-based docstring generation with fallback behavior."""
    
    def __init__(self, api_key: str, model: str = "claude-3-5-sonnet-20241022"):
        self.client = anthropic.Anthropic(api_key=api_key)
        self.model = model
        self._available = True
    
    def is_available(self) -> bool:
        return self._available
    
    def generate_docstring(self, code: str) -> Optional[str]:
        if not self.is_available():
            return None
        
        try:
            message = self.client.messages.create(
                model=self.model,
                max_tokens=500,
                messages=[
                    {
                        "role": "user",
                        "content": f"Write a concise docstring for this function:\n\n{code}"
                    }
                ]
            )
            return message.content[0].text
        except Exception as e:
            print(f"LLM service error: {e}")
            self._available = False
            return None


class NoOpDocstringGenerator(CodeGenerationService):
    """Fallback implementation when LLM is unavailable."""
    
    def is_available(self) -> bool:
        return False
    
    def generate_docstring(self, code: str) -> Optional[str]:
        return None

This abstraction lets you:

  • Enable/disable LLM features without refactoring
  • Test your code without API calls
  • Swap LLM providers easily
  • Provide fallback behavior

Strategy 2: Opt-In Feature Flags

Never automatically use LLM outputs. Use feature flags to control where LLM integration happens:

# config.py
import os
from dataclasses import dataclass

@dataclass
class LLMConfig:
    enabled: bool = os.getenv("LLM_ENABLED", "false").lower() == "true"
    api_key: str = os.getenv("ANTHROPIC_API_KEY", "")
    model: str = os.getenv("LLM_MODEL", "claude-3-5-sonnet-20241022")
    use_for_docstrings: bool = os.getenv("LLM_USE_DOCSTRINGS", "false").lower() == "true"
    use_for_tests: bool = os.getenv("LLM_USE_TESTS", "false").lower() == "true"
    max_tokens_per_request: int = 1000


# your_module.py
from config import LLMConfig
from llm_service import AnthropicDocstringGenerator, NoOpDocstringGenerator

config = LLMConfig()

if config.enabled and config.api_key:
    llm = AnthropicDocstringGenerator(api_key=config.api_key, model=config.model)
else:
    llm = NoOpDocstringGenerator()


def document_function(func_code: str) -> str:
    """Add docstring to function, with optional LLM assistance."""
    if config.use_for_docstrings and llm.is_available():
        generated = llm.generate_docstring(func_code)
        if generated:
            return generated
    
    # Fallback: return code as-is or use a template
    return '"""[TODO: Add docstring]"""'

Strategy 3: Validation and Cost Controls

Before using LLM output in production, validate it:

from typing import Callable
import hashlib

class LLMOutputValidator:
    """Validates and logs LLM outputs to prevent bad code."""
    
    def __init__(self, max_tokens: int = 1000):
        self.max_tokens = max_tokens
        self.call_log = []
    
    def validate_code_output(self, llm_output: str, original_code: str) -> bool:
        """Basic validation before using generated code."""
        # Check length
        if len(llm_output.split()) > self.max_tokens:
            return False
        
        # Ensure it's not just copying input
        if llm_output.strip() == original_code.strip():
            return False
        
        # Check for obviously broken syntax
        try:
            compile(llm_output, '<string>', 'exec')
        except SyntaxError:
            return False
        
        return True
    
    def log_call(self, prompt: str, output: str, tokens_used: int, valid: bool):
        """Log LLM calls for auditing and cost analysis."""
        self.call_log.append({
            'prompt_hash': hashlib.sha256(prompt.encode()).hexdigest(),
            'output_length': len(output),
            'tokens_used': tokens_used,
            'valid': valid
        })

Integration Timeline: The Right Order

Integrate LLMs incrementally in this order:

| Phase | Task | Risk Level | Reversibility | |-------|------|------------|----------------| | 1 | Create abstraction layer | Very Low | High | | 2 | Add feature flags | Very Low | High | | 3 | Use LLMs for code documentation only | Low | High | | 4 | Use LLMs for test generation (with review) | Medium | Medium | | 5 | Use LLMs for code suggestions in CI/CD | Medium | Medium | | 6 | Consider production code generation | High | Low |

Common Mistakes to Avoid

Mistake 1: Removing human review Always require human review before LLM-generated code ships to production. Use it as assistance, not replacement.

Mistake 2: Ignoring API costs Track tokens used across your organization. Set hard limits:

class CostAwareLLMService(CodeGenerationService):
    def __init__(self, api_key: str, monthly_budget_dollars: float = 100):
        self.client = anthropic.Anthropic(api_key=api_key)
        self.monthly_budget = monthly_budget_dollars
        self.tokens_used_this_month = 0
    
    def remaining_budget(self) -> float:
        # Rough estimate: $0.003 per 1K tokens (adjust for your model)
        cost_so_far = (self.tokens_used_this_month / 1000) * 0.003
        return self.monthly_budget - cost_so_far
    
    def can_make_request(self) -> bool:
        return self.remaining_budget() > 0.01  # Stop at 1 cent left

Mistake 3: Mixing concerns Keep LLM logic separate from business logic. Use dependency injection:

def process_order(order_id: str, llm_service: CodeGenerationService = None) -> None:
    """Process order. LLM enhancement is optional."""
    order = fetch_order(order_id)
    # Core logic unchanged
    validate_order(order)
    charge_card(order)
    
    # Optional enhancement
    if llm_service and llm_service.is_available():
        generate_summary = llm_service.generate_summary(order)

Testing Without External APIs

import unittest
from unittest.mock import Mock

class TestLLMIntegration(unittest.TestCase):
    def test_document_function_with_llm_unavailable(self):
        """Verify fallback behavior when LLM is down."""
        mock_llm = Mock(spec=CodeGenerationService)
        mock_llm.is_available.return_value = False
        
        result = document_function("def foo(): pass", llm_service=mock_llm)
        self.assertIn("TODO", result)
    
    def test_document_function_with_llm_available(self):
        """Verify LLM output is used when available."""
        mock_llm = Mock(spec=CodeGenerationService)
        mock_llm.is_available.return_value = True
        mock_llm.generate_docstring.return_value = '"""Does foo."""'
        
        result = document_function("def foo(): pass", llm_service=mock_llm)
        self.assertEqual(result, '"""Does foo."""')

Monitoring and Metrics

Track key metrics to understand LLM impact:

import time
from dataclasses import dataclass

@dataclass
class LLMMetrics:
    total_requests: int = 0
    successful_requests: int = 0
    failed_requests: int = 0
    avg_response_time_ms: float = 0
    tokens_used: int = 0
    human_approvals: int = 0
    
    def approval_rate(self) -> float:
        if self.successful_requests == 0:
            return 0
        return (self.human_approvals / self.successful_requests) * 100

Use these metrics to decide:

  • Should you increase/decrease LLM usage?
  • Is the cost justified by productivity gains?
  • Are outputs genuinely useful or just creating review overhead?

Conclusion

LLM integration isn't a binary decision. Treat it as a gradual, measured addition to your development workflow. Start small with well-isolated features (documentation, test scaffolding), maintain strict cost controls, require human review, and measure actual impact on productivity.

The principle of "no silver bullet" applies: LLMs solve specific problems well, but they're not a replacement for solid engineering practices, testing, and code review.

Recommended Tools