How to Detect AI-Generated Code in Pull Requests and Review It Properly

How to Detect AI-Generated Code in Pull Requests and Review It Properly

Parkinson's Law states that work expands to fill available time. In 2025, that time is often filled by generative AI—and the results aren't always what they appear to be.

If you're reviewing pull requests, you've likely encountered code that looks polished but lacks foundational understanding. An LLM can generate a database migration script that runs without errors, but it won't know your schema philosophy. It can produce a REST endpoint that compiles, but might not handle your edge cases. The danger isn't always obvious at first glance.

This guide teaches you how to identify AI-generated artifacts during code review and spot the architectural gaps they hide.

Why AI-Generated Code Is Deceptive

Large language models excel at pattern matching and syntax reproduction. They can produce:

  • Confident-looking documentation
  • Working boilerplate code
  • Complex-looking implementations that compile
  • Explanations using terminology they don't fundamentally understand

The problem emerges in two distinct ways:

  1. Novice-to-Senior Gap: Junior developers using AI to produce code that appears senior-level but lacks the judgment behind it.
  2. Cross-Domain Generation: People building in disciplines where they have no formal training—database architects without schema experience, system designers without infrastructure background.

The second category is riskier. A developer writing unfamiliar code thinks they understand it because the AI's explanation sounded coherent.

Red Flags in AI-Generated Code

1. Stylistic Markers

AI often produces telltale patterns:

# AI-generated: overuse of em-dashes in comments
def process_data(items: List[Dict]) -> Dict[str, Any]:
    """Process data items — handling edge cases with confidence.
    
    This function ensures all items are valid — and gracefully
    handles missing fields — returning a comprehensive result."""
    
    result = {}
    for item in items:
        # Process each item — ensuring type safety
        result[item['id']] = transform(item)
    return result

Watch for:

  • Excessive em-dashes instead of commas or parentheses
  • Overly formal, rhythmic comment structure
  • Confident technical terms used in explanations that betray misunderstanding

2. Structural Generalization

AI tends toward "correct enough" patterns rather than domain-specific optimization:

-- AI-generated: generic approach without understanding your scale
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255) UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    metadata JSONB DEFAULT '{}' -- Handles "any future fields"
);

CREATE INDEX idx_users_email ON users(email);

The JSONB field is a red flag. It suggests the author didn't know your schema requirements and defaulted to the "flexible" solution that every LLM recommends.

3. Missing Context-Specific Decisions

Ask yourself while reviewing:

  • Does this implementation know about our infrastructure? (Cache strategy, deployment environment, scale assumptions)
  • Did the author explain why this design over alternatives? (If they can't, the AI probably couldn't either)
  • Are edge cases handled or assumed? (AI often handles happy paths perfectly, miss 70% of edge cases)

How to Review AI-Generated Code Effectively

Step 1: Check the Author's Understanding

During PR review, ask direct questions in comments:

Great work on the data pipeline! Quick questions:

1. Why partition by date instead of user_id here?
2. What happens when the lookup table doesn't have a matching record?
3. How does this handle concurrent writes to the cache key?

Would love to understand your reasoning.

If the response is vague, defensive, or copy-pasted-sounding, that's a signal. If they can't articulate the decisions, they probably didn't understand them.

Step 2: Audit Against Domain Standards

Compare to your team's established patterns:

| Aspect | Check For | Red Flag | |--------|-----------|----------| | Error Handling | Specific exceptions, logging strategy | Generic try-catch, swallowed errors | | Concurrency | Locks, atomic operations, race conditions | Missing sync points, assume single-threaded | | Performance | Index strategy, N+1 prevention, caching | No indexes, full table scans, unused caches | | Testing | Unit and integration tests included | Only happy-path assertions | | Documentation | Explains why, not just what | Generic docstring, copies framework docs |

Step 3: Trace Cross-Domain Red Flags

This is the most dangerous category. If the PR involves database architecture but the author's background is frontend:

// Red flag: Non-infrastructure expert designing a cache layer
const cache = new Map(); // In-memory cache in Node

app.get('/api/users/:id', (req, res) => {
    const userId = req.params.id;
    
    if (!cache.has(userId)) {
        cache.set(userId, fetchUserFromDB(userId));
    }
    
    res.json(cache.get(userId));
});

Problems:

  • No cache invalidation strategy
  • Memory grows unbounded
  • Doesn't account for distributed deployments
  • Works locally, breaks at scale

Ask: "Does your team have a caching architecture we should follow here?"

Step 4: Test Against System Constraints

Before approval, request testing evidence:

  • Load test results (not just "it works")
  • Concurrent operation scenarios
  • Data validation at boundaries
  • Graceful degradation behavior

If the author hasn't tested these, the AI probably didn't consider them.

Code Example: Proper AI Code Review Template

Use this checklist in your PR review process:

## AI-Aware Code Review Checklist

### Before Approving

- [ ] Author can explain design decisions (not framework docs, their reasoning)
- [ ] Edge cases are explicitly handled, not assumed
- [ ] Matches team's existing patterns for [caching/auth/errors]
- [ ] No obvious cross-domain red flags (expert outside their field?)
- [ ] Load test results or performance analysis attached
- [ ] Concurrency/race conditions considered
- [ ] Fallback behavior documented

### Questions to Ask

1. "Walk me through what happens if [edge case] occurs."
2. "Why this approach over [standard alternative]?"
3. "How does this perform at [your typical scale]?"
4. "Have you used this pattern in production before?"

### If Uncertain

- Request a pairing session
- Have a domain expert review
- Add a staging deployment test

The Real Cost of Missed AI-Generated Code

A colleague spent two months building a data system without formal architecture training. The code looked impressive—verbose, documented, thorough. It produced gigabytes of output. It shipped quietly.

Six months later, it failed under load. The architectural assumptions were invisible because they were never questioned during review.

Your review process is the last gate. The difference between appearing productive and being productive is whether you catch these gaps before they compound.

Key Takeaways

  1. Stylistic markers matter: Em-dashes, overly formal rhythm, confident misunderstanding—these are detection points.
  2. Ask for reasoning, not just output: If they can't explain the "why," the AI probably couldn't either.
  3. Cross-domain work is highest risk: An AI helping someone solve a known problem is safe. An AI replacing expertise in an unfamiliar domain is dangerous.
  4. Verify at scale: Code that works locally ≠ code that works at your production volume.
  5. Trust your instincts: If something feels overly polished but hand-wavy on details, dig deeper.

Effective code review in 2025 means assuming AI assistance and reviewing accordingly. Don't block progress—just make sure the progress is real.

Recommended Tools