VibeTDD Experiment 2.1: The Test-After Trap - When AI 'Covers' Existing Code

This is Phase 2.1 of my VibeTDD series - an unplanned experiment that emerged from a common claim I keep hearing in the AI development community.

The Popular Myth

Before moving to Phase 3 where I'd take control of the TDD process, I kept encountering this argument:

"Why do TDD when AI can generate both code AND tests? We write the logic first, then ask AI to create comprehensive test coverage. It works great!"

I've heard this from multiple developers who swear by the approach. They claim AI generates thorough tests that catch bugs and provide good coverage. But something felt off about this.

My doubt: Are we sure AI covers everything properly, or does it just adapt tests to whatever code exists? How do you validate that generated tests are actually testing the right things? What happens when you need to modify the logic?

Time for an experiment.

The Setup: Testing the Test-After Approach

I decided to implement the enhanced payout service from Phase 2 using the "code-first, tests-after" methodology that many developers advocate.

The Rules:

Implement the complete feature first (no tests)
Ask Claude to generate comprehensive test coverage
See what kind of tests AI creates for existing code
Add a new requirement and observe how it handles changes

The Requirements (same as Phase 2):

Validate payout data (UserId, Amount, Currency)
Amount must not exceed 30
Only EUR, USD, GBP currencies allowed
Sum of all user payouts must not exceed 100
Store valid payouts in memory

What Claude Built: The Monolithic Approach

The "Complete" Implementation

Claude confidently created this service:

kotlin

class PayoutService(private val storage: PayoutStorage) {
    
    companion object {
        private val MAX_PAYOUT_AMOUNT = BigDecimal("30")
        private val MAX_USER_TOTAL = BigDecimal("100")
        private val ALLOWED_CURRENCIES = setOf(Currency.EUR, Currency.USD, Currency.GBP)
    }
    
    fun processPayout(userId: String?, amount: BigDecimal?, currency: Currency?): PayoutResult {
        return try {
            val validatedPayout = validateAndCreatePayout(userId, amount, currency)
            storage.store(validatedPayout)
            PayoutResult.Success(validatedPayout)
        } catch (e: PayoutValidationException) {
            PayoutResult.Failure(e.error)
        }
    }
    
    private fun validateAndCreatePayout(userId: String?, amount: BigDecimal?, currency: Currency?): Payout {
        // Validate required fields
        if (userId.isNullOrBlank()) {
            throw PayoutValidationException(PayoutError.MissingUserId)
        }
        
        if (amount == null) {
            throw PayoutValidationException(PayoutError.MissingAmount)
        }
        
        if (currency == null) {
            throw PayoutValidationException(PayoutError.MissingCurrency)
        }
        
        // Validate amount constraints
        if (amount <= BigDecimal.ZERO || amount > MAX_PAYOUT_AMOUNT) {
            throw PayoutValidationException(PayoutError.InvalidAmount)
        }
        
        // Validate currency
        if (currency !in ALLOWED_CURRENCIES) {
            throw PayoutValidationException(PayoutError.InvalidCurrency)
        }
        
        // Check user total limit
        val currentUserTotal = storage.getTotalPayoutByUserId(userId)
        if (currentUserTotal + amount > MAX_USER_TOTAL) {
            throw PayoutValidationException(
                PayoutError.UserLimitExceeded(currentUserTotal, MAX_USER_TOTAL)
            )
        }
        
        return Payout(userId = userId, amount = amount, currency = currency)
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

Red Flags Immediately Obvious:

Hardcoded business rules (MAX_PAYOUT_AMOUNT, ALLOWED_CURRENCIES)
Multiple responsibilities in one method (validation + business logic)
Impossible to test in isolation - every test needs real storage
No separation of concerns - adding new validation rules means modifying core logic

But I wanted to see what tests AI would generate for this mess.

The Test Generation Disaster

Problem 1: Configuration Chaos

My first attempt to run the tests failed immediately:

bash

[ERROR] TestEngine with ID 'junit-jupiter' failed to discover tests
[ERROR] There was an error in the forked process

1
2

When I provided this error to Claude, its response was shocking:

"This is likely due to missing dependencies or configuration issues. Let me implement tests in Java + add a manual runner so you can run it if tests still won't work using JUnit."

Wait, what? Instead of fixing the Maven configuration, Claude:

Switched from Kotlin to Java for tests (defeating the purpose)
Created a manual test runner using main() methods
Suggested bypassing the testing framework entirely

This immediately revealed a fundamental problem: AI doesn't understand that broken infrastructure needs to be fixed, not worked around.

Problem 2: Shotgun Testing

Once I forced Claude to fix the configuration properly, it generated this test class:

kotlin

class PayoutServiceTest {
    
    @Test
    fun `should process valid payout successfully`() { /* basic test */ }
    
    @Test
    fun `should fail when userId is null`() { /* null test */ }
    
    @Test
    fun `should fail when amount exceeds 30`() { /* boundary test */ }
    
    @Test
    fun `should allow all supported currencies`() {
        // Test EUR
        var result = payoutService.processPayout("user1", BigDecimal("10"), Currency.EUR)
        assertTrue(result is PayoutResult.Success)
        
        // Test USD
        result = payoutService.processPayout("user2", BigDecimal("10"), Currency.USD)
        assertTrue(result is PayoutResult.Success)
        
        // Test GBP
        result = payoutService.processPayout("user3", BigDecimal("10"), Currency.GBP)
        assertTrue(result is PayoutResult.Success)
    }
    
    @Test
    fun `should track payouts separately for different users`() { /* ... */ }
    
    @Test
    fun `should fail when user total would exceed 100`() { /* ... */ }
    
    // ... 15 more similar tests
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

Problems with this approach:

Shotgun testing: One massive test class trying to cover everything
Inefficient coverage: Tests like "allow all supported currencies" would need massive changes if we added more currencies
No isolation: Every test depends on the monolithic service
Impossible to maintain: Adding validation rules requires updating dozens of tests

Problem 3: The False Confidence

The most dangerous part was Claude's confidence:

"These tests provide comprehensive coverage of all validation scenarios and edge cases. The test suite ensures the service behaves correctly across all supported operations."

But when I looked closer:

Tests were testing the implementation, not behavior
No separation between different types of validation
Impossible to test individual business rules in isolation
Changes to any validation rule would break multiple tests

The Change Request: Adding Currency Restrictions

Now came the real test. I added a new requirement:

"Restrict specific users to use only certain currencies (e.g., User A can only use EUR)"

Claude's "Solution"

As expected, Claude made changes throughout the existing codebase:

Updated Service (now even messier):

kotlin

class PayoutService(
    private val storage: PayoutStorage,
    private val currencyRestrictions: CurrencyRestrictions? = null
) {
    
    private fun validateAndCreatePayout(userId: String?, amount: BigDecimal?, currency: Currency?): Payout {
        // ... existing validation logic ...
        
        // NEW: Validate user-specific currency restrictions
        currencyRestrictions?.let { restrictions ->
            if (!restrictions.isCurrencyAllowed(userId, currency)) {
                val allowedCurrencies = restrictions.getAllowedCurrencies(userId) ?: ALLOWED_CURRENCIES
                throw PayoutValidationException(
                    PayoutError.CurrencyNotAllowedForUser(userId, currency, allowedCurrencies)
                )
            }
        }
        
        // ... rest of validation ...
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

The Problems Multiplied:

Even more responsibilities in the same method
Optional dependencies making testing complex
Validation order matters but isn't explicit
Configuration scattered across multiple places

The Test Impact Explosion

Adding this single feature required changes to:

8 existing test methods (had to mock new dependency)
12 new test methods for currency restrictions
Complex test setup with multiple mocks
Parameterized tests that became unwieldy

Example of the resulting test complexity:

kotlin

@ExtendWith(MockKExtension::class)
class PayoutServiceTest {
    
    @InjectMockKs
    private lateinit var payoutService: PayoutService
    
    @MockK
    private lateinit var storage: PayoutStorage
    
    @MockK
    private lateinit var currencyRestrictions: CurrencyRestrictions
    
    @Test
    fun `should reject payout when currency is not in user's allowed list`() {
        // Given
        every { currencyRestrictions.isCurrencyAllowed("user123", Currency.USD) } returns false
        every { currencyRestrictions.getAllowedCurrencies("user123") } returns setOf(Currency.EUR)
        
        // When
        val result = payoutService.processPayout("user123", BigDecimal("10"), Currency.USD)
        
        // Then
        assertTrue(result is PayoutResult.Failure)
        val error = (result as PayoutResult.Failure).error
        assertTrue(error is PayoutError.CurrencyNotAllowedForUser)
    }
    
    // ... 30 more tests, each with complex mock setup
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

The Damning Discoveries

Discovery 1: AI Doesn't Test Behavior, It Tests Implementation

The generated tests were tightly coupled to the implementation details. They tested:

How validation was implemented
What order validations ran in
Which exceptions were thrown where

Instead of testing:

What business rules should be enforced
When those rules should apply
Why certain inputs should be valid/invalid

Discovery 2: Test Maintenance Becomes a Nightmare

Every change to business logic required:

Updating multiple test methods (no clear separation)
Modifying mock setups across dozens of tests
Reorganizing test data to match new implementation
Debugging test failures caused by implementation changes, not requirement changes

Discovery 3: False Coverage Confidence

The test coverage metrics looked great:

95% line coverage
All branches tested
Comprehensive edge case scenarios

But the tests provided zero confidence for refactoring or changing business rules because they were testing implementation, not behavior.

Discovery 4: AI Creates Tests That Look Right

This was the most insidious problem. The generated tests looked professional:

Good naming conventions
Proper test structure
Comprehensive scenarios
Clean assertions

But they were fundamentally flawed from an architecture perspective.

The Comparison: What TDD Would Have Produced

If I had followed proper TDD (like the conventions from my Phase 2 learnings), I would have:

Separate validators:

kotlin

interface PayoutValidator {
    fun validate(payout: Payout)
}

class AmountValidator(private val config: PayoutConfiguration) : PayoutValidator
class CurrencyValidator(private val config: PayoutConfiguration) : PayoutValidator  
class UserLimitValidator(private val storage: PayoutStorage, private val config: PayoutConfiguration) : PayoutValidator
class CurrencyRestrictionValidator(private val restrictions: CurrencyRestrictions) : PayoutValidator

1
2
3
4
5
6
7
8

Clean service orchestration:

kotlin

class PayoutService(
    private val storage: PayoutStorage,
    private val validators: List<PayoutValidator>
) {
    fun process(payout: Payout) {
        validators.forEach { it.validate(payout) }
        storage.store(payout)
    }
}

1
2
3
4
5
6
7
8
9

Focused, maintainable tests:

kotlin

class AmountValidatorTest {
    @Test
    fun `should throw exception when amount exceeds configured limit`() {
        every { config.getMaxAmount() } returns 30.0
        val payout = PayoutMother.of(amount = 35.0)
        
        val exception = shouldThrow<ValidationException> {
            validator.validate(payout)
        }
        exception.code shouldBe AMOUNT_EXCEEDED
    }
}

1
2
3
4
5
6
7
8
9
10
11
12

Adding currency restrictions would have required:

One new validator class
One new test class
Zero changes to existing code
Zero changes to existing tests

The Verdict: Test-After is an Anti-Pattern

The "generate tests for existing code" approach is fundamentally flawed because:

❌ It Encourages Poor Design

Code written without tests tends toward monolithic structures
No pressure to create testable, modular components
Business logic gets mixed with infrastructure concerns

❌ Tests Become Implementation-Dependent

Generated tests lock in current implementation
Refactoring becomes impossible without rewriting tests
Changes cascade through multiple test methods

❌ False Confidence in Coverage

High coverage metrics don't mean good tests
Tests pass but don't prevent regressions
Missing edge cases aren't obvious

❌ Maintenance Nightmare

Every feature addition requires updating multiple tests
Test failures don't indicate requirement violations
Debugging test issues becomes as complex as debugging production code

❌ AI Amplifies Anti-Patterns

AI creates tests that look comprehensive but aren't
No architectural pressure to write better code
Quick feedback loop creates false sense of quality

Key Insights for VibeTDD

This experiment reinforced why test-first is crucial when working with AI:

Tests as Design Pressure: Writing tests first forces you to think about interfaces and separation of concerns
Behavior Over Implementation: TDD focuses on what the code should do, not how it does it
Incremental Validation: Each test validates one specific behavior in isolation
Refactoring Safety: Well-designed tests enable confident refactoring
AI Needs Constraints: Without test-driven constraints, AI defaults to expedient but unmaintainable solutions

The Pattern Recognition

I'm starting to see a clear pattern across all VibeTDD experiments:

Phase 1 (Calculator): Simple problem → AI TDD works well
Phase 2 (Complex TDD): Complex problem → AI TDD breaks down
Phase 2.1 (Test-After): Any complexity + test-after → Disaster

The conclusion is becoming clear: AI needs the discipline that TDD provides, but can't provide that discipline itself.

Next: Taking Control

Phase 2.1 confirmed my suspicions about the test-after approach. It's time for Phase 3: Human-led TDD with AI as implementation assistant.

The hypothesis: If I provide the architectural discipline through test-first design, can AI serve as an effective code generation tool while maintaining quality?

Let's find out if the test-first approach can harness AI's speed while avoiding the architectural disasters I've witnessed so far.

This experiment was eye-opening about how dangerous the "AI generates tests for existing code" approach really is. The code looks good, the tests pass, but the foundation is rotten. Next up: testing whether human-led TDD can keep AI on the right path. Follow the VibeTDD roadmap for the complete journey.

Code Repository

The complete code from this experiment is available at: VibeTDD Phase 2.1 Repository

VibeTDD Experiment 2.1: The Test-After Trap - When AI 'Covers' Existing Code #

The Popular Myth #

The Setup: Testing the Test-After Approach #

What Claude Built: The Monolithic Approach #

The "Complete" Implementation #

The Test Generation Disaster #

Problem 1: Configuration Chaos #

Problem 2: Shotgun Testing #

Problem 3: The False Confidence #

The Change Request: Adding Currency Restrictions #

Claude's "Solution" #

The Test Impact Explosion #

The Damning Discoveries #

Discovery 1: AI Doesn't Test Behavior, It Tests Implementation #

Discovery 2: Test Maintenance Becomes a Nightmare #

Discovery 3: False Coverage Confidence #

Discovery 4: AI Creates Tests That Look Right #

The Comparison: What TDD Would Have Produced #

The Verdict: Test-After is an Anti-Pattern #

❌ It Encourages Poor Design #

❌ Tests Become Implementation-Dependent #

❌ False Confidence in Coverage #

❌ Maintenance Nightmare #

❌ AI Amplifies Anti-Patterns #

Key Insights for VibeTDD #

The Pattern Recognition #

Next: Taking Control #

Code Repository #

VibeTDD Experiment 2.1: The Test-After Trap - When AI 'Covers' Existing Code

The Popular Myth

The Setup: Testing the Test-After Approach

What Claude Built: The Monolithic Approach

The "Complete" Implementation

The Test Generation Disaster

Problem 1: Configuration Chaos

Problem 2: Shotgun Testing

Problem 3: The False Confidence

The Change Request: Adding Currency Restrictions

Claude's "Solution"

The Test Impact Explosion

The Damning Discoveries

Discovery 1: AI Doesn't Test Behavior, It Tests Implementation

Discovery 2: Test Maintenance Becomes a Nightmare

Discovery 3: False Coverage Confidence

Discovery 4: AI Creates Tests That Look Right

The Comparison: What TDD Would Have Produced

The Verdict: Test-After is an Anti-Pattern

❌ It Encourages Poor Design

❌ Tests Become Implementation-Dependent

❌ False Confidence in Coverage

❌ Maintenance Nightmare

❌ AI Amplifies Anti-Patterns

Key Insights for VibeTDD

The Pattern Recognition

Next: Taking Control

Code Repository