How to Integrate LLMs into Your Python Project Without Breaking Existing Code
How to Integrate LLMs into Your Python Project Without Breaking Existing Code
Large language models (LLMs) are increasingly part of modern development workflows, but adding them to an existing Python project introduces real challenges. You need to maintain backwards compatibility, manage API costs, handle failures gracefully, and avoid the temptation to replace working code with untested LLM outputs.
This guide walks you through a production-ready approach to LLM integration that follows the principle of incremental adoption rather than wholesale replacement.
Understanding the LLM Integration Problem
Many developers approach LLM integration like a silver bullet—expecting it to solve productivity problems across their entire codebase. As Fred Brooks noted in "No Silver Bullet," no single technology solves the fundamental complexities of software development. The same applies to LLMs.
When integrating LLMs into existing Python projects, you're solving a specific problem: which parts of your codebase benefit from language model capabilities, and which parts should remain untouched?
Common integration pitfalls include:
- Replacing stable, tested code with LLM-generated alternatives
- Creating hard dependencies on external APIs without fallbacks
- Ignoring token costs and API rate limits
- Mixing LLM logic throughout your codebase instead of isolating it
- Skipping validation of LLM outputs before using them
Strategy 1: Isolation Through Abstraction
Start by creating an abstraction layer that isolates LLM concerns from your core application logic.
# llm_service.py
from abc import ABC, abstractmethod
from typing import Optional
import anthropic
class CodeGenerationService(ABC):
"""Abstract base for code generation—allows swapping implementations."""
@abstractmethod
def generate_docstring(self, code: str) -> Optional[str]:
"""Generate docstring for a function. Returns None if LLM unavailable."""
pass
@abstractmethod
def is_available(self) -> bool:
"""Check if the service is operational."""
pass
class AnthropicDocstringGenerator(CodeGenerationService):
"""Claude-based docstring generation with fallback behavior."""
def __init__(self, api_key: str, model: str = "claude-3-5-sonnet-20241022"):
self.client = anthropic.Anthropic(api_key=api_key)
self.model = model
self._available = True
def is_available(self) -> bool:
return self._available
def generate_docstring(self, code: str) -> Optional[str]:
if not self.is_available():
return None
try:
message = self.client.messages.create(
model=self.model,
max_tokens=500,
messages=[
{
"role": "user",
"content": f"Write a concise docstring for this function:\n\n{code}"
}
]
)
return message.content[0].text
except Exception as e:
print(f"LLM service error: {e}")
self._available = False
return None
class NoOpDocstringGenerator(CodeGenerationService):
"""Fallback implementation when LLM is unavailable."""
def is_available(self) -> bool:
return False
def generate_docstring(self, code: str) -> Optional[str]:
return None
This abstraction lets you:
- Enable/disable LLM features without refactoring
- Test your code without API calls
- Swap LLM providers easily
- Provide fallback behavior
Strategy 2: Opt-In Feature Flags
Never automatically use LLM outputs. Use feature flags to control where LLM integration happens:
# config.py
import os
from dataclasses import dataclass
@dataclass
class LLMConfig:
enabled: bool = os.getenv("LLM_ENABLED", "false").lower() == "true"
api_key: str = os.getenv("ANTHROPIC_API_KEY", "")
model: str = os.getenv("LLM_MODEL", "claude-3-5-sonnet-20241022")
use_for_docstrings: bool = os.getenv("LLM_USE_DOCSTRINGS", "false").lower() == "true"
use_for_tests: bool = os.getenv("LLM_USE_TESTS", "false").lower() == "true"
max_tokens_per_request: int = 1000
# your_module.py
from config import LLMConfig
from llm_service import AnthropicDocstringGenerator, NoOpDocstringGenerator
config = LLMConfig()
if config.enabled and config.api_key:
llm = AnthropicDocstringGenerator(api_key=config.api_key, model=config.model)
else:
llm = NoOpDocstringGenerator()
def document_function(func_code: str) -> str:
"""Add docstring to function, with optional LLM assistance."""
if config.use_for_docstrings and llm.is_available():
generated = llm.generate_docstring(func_code)
if generated:
return generated
# Fallback: return code as-is or use a template
return '"""[TODO: Add docstring]"""'
Strategy 3: Validation and Cost Controls
Before using LLM output in production, validate it:
from typing import Callable
import hashlib
class LLMOutputValidator:
"""Validates and logs LLM outputs to prevent bad code."""
def __init__(self, max_tokens: int = 1000):
self.max_tokens = max_tokens
self.call_log = []
def validate_code_output(self, llm_output: str, original_code: str) -> bool:
"""Basic validation before using generated code."""
# Check length
if len(llm_output.split()) > self.max_tokens:
return False
# Ensure it's not just copying input
if llm_output.strip() == original_code.strip():
return False
# Check for obviously broken syntax
try:
compile(llm_output, '<string>', 'exec')
except SyntaxError:
return False
return True
def log_call(self, prompt: str, output: str, tokens_used: int, valid: bool):
"""Log LLM calls for auditing and cost analysis."""
self.call_log.append({
'prompt_hash': hashlib.sha256(prompt.encode()).hexdigest(),
'output_length': len(output),
'tokens_used': tokens_used,
'valid': valid
})
Integration Timeline: The Right Order
Integrate LLMs incrementally in this order:
| Phase | Task | Risk Level | Reversibility | |-------|------|------------|----------------| | 1 | Create abstraction layer | Very Low | High | | 2 | Add feature flags | Very Low | High | | 3 | Use LLMs for code documentation only | Low | High | | 4 | Use LLMs for test generation (with review) | Medium | Medium | | 5 | Use LLMs for code suggestions in CI/CD | Medium | Medium | | 6 | Consider production code generation | High | Low |
Common Mistakes to Avoid
Mistake 1: Removing human review Always require human review before LLM-generated code ships to production. Use it as assistance, not replacement.
Mistake 2: Ignoring API costs Track tokens used across your organization. Set hard limits:
class CostAwareLLMService(CodeGenerationService):
def __init__(self, api_key: str, monthly_budget_dollars: float = 100):
self.client = anthropic.Anthropic(api_key=api_key)
self.monthly_budget = monthly_budget_dollars
self.tokens_used_this_month = 0
def remaining_budget(self) -> float:
# Rough estimate: $0.003 per 1K tokens (adjust for your model)
cost_so_far = (self.tokens_used_this_month / 1000) * 0.003
return self.monthly_budget - cost_so_far
def can_make_request(self) -> bool:
return self.remaining_budget() > 0.01 # Stop at 1 cent left
Mistake 3: Mixing concerns Keep LLM logic separate from business logic. Use dependency injection:
def process_order(order_id: str, llm_service: CodeGenerationService = None) -> None:
"""Process order. LLM enhancement is optional."""
order = fetch_order(order_id)
# Core logic unchanged
validate_order(order)
charge_card(order)
# Optional enhancement
if llm_service and llm_service.is_available():
generate_summary = llm_service.generate_summary(order)
Testing Without External APIs
import unittest
from unittest.mock import Mock
class TestLLMIntegration(unittest.TestCase):
def test_document_function_with_llm_unavailable(self):
"""Verify fallback behavior when LLM is down."""
mock_llm = Mock(spec=CodeGenerationService)
mock_llm.is_available.return_value = False
result = document_function("def foo(): pass", llm_service=mock_llm)
self.assertIn("TODO", result)
def test_document_function_with_llm_available(self):
"""Verify LLM output is used when available."""
mock_llm = Mock(spec=CodeGenerationService)
mock_llm.is_available.return_value = True
mock_llm.generate_docstring.return_value = '"""Does foo."""'
result = document_function("def foo(): pass", llm_service=mock_llm)
self.assertEqual(result, '"""Does foo."""')
Monitoring and Metrics
Track key metrics to understand LLM impact:
import time
from dataclasses import dataclass
@dataclass
class LLMMetrics:
total_requests: int = 0
successful_requests: int = 0
failed_requests: int = 0
avg_response_time_ms: float = 0
tokens_used: int = 0
human_approvals: int = 0
def approval_rate(self) -> float:
if self.successful_requests == 0:
return 0
return (self.human_approvals / self.successful_requests) * 100
Use these metrics to decide:
- Should you increase/decrease LLM usage?
- Is the cost justified by productivity gains?
- Are outputs genuinely useful or just creating review overhead?
Conclusion
LLM integration isn't a binary decision. Treat it as a gradual, measured addition to your development workflow. Start small with well-isolated features (documentation, test scaffolding), maintain strict cost controls, require human review, and measure actual impact on productivity.
The principle of "no silver bullet" applies: LLMs solve specific problems well, but they're not a replacement for solid engineering practices, testing, and code review.
Recommended Tools
- Anthropic Claude APIBuild AI-powered applications with Claude
- GitHubWhere the world builds software