VibeTDD Experiment 2.1: The Test-After Trap - When AI 'Covers' Existing Code
This is Phase 2.1 of my VibeTDD series - an unplanned experiment that emerged from a common claim I keep hearing in the AI development community.
The Popular Myth
Before moving to Phase 3 where I'd take control of the TDD process, I kept encountering this argument:
"Why do TDD when AI can generate both code AND tests? We write the logic first, then ask AI to create comprehensive test coverage. It works great!"
I've heard this from multiple developers who swear by the approach. They claim AI generates thorough tests that catch bugs and provide good coverage. But something felt off about this.
My doubt: Are we sure AI covers everything properly, or does it just adapt tests to whatever code exists? How do you validate that generated tests are actually testing the right things? What happens when you need to modify the logic?
Time for an experiment.
The Setup: Testing the Test-After Approach
I decided to implement the enhanced payout service from Phase 2 using the "code-first, tests-after" methodology that many developers advocate.
The Rules:
- Implement the complete feature first (no tests)
- Ask Claude to generate comprehensive test coverage
- See what kind of tests AI creates for existing code
- Add a new requirement and observe how it handles changes
The Requirements (same as Phase 2):
- Validate payout data (UserId, Amount, Currency)
- Amount must not exceed 30
- Only EUR, USD, GBP currencies allowed
- Sum of all user payouts must not exceed 100
- Store valid payouts in memory
What Claude Built: The Monolithic Approach
The "Complete" Implementation
Claude confidently created this service:
class PayoutService(private val storage: PayoutStorage) {
companion object {
private val MAX_PAYOUT_AMOUNT = BigDecimal("30")
private val MAX_USER_TOTAL = BigDecimal("100")
private val ALLOWED_CURRENCIES = setOf(Currency.EUR, Currency.USD, Currency.GBP)
}
fun processPayout(userId: String?, amount: BigDecimal?, currency: Currency?): PayoutResult {
return try {
val validatedPayout = validateAndCreatePayout(userId, amount, currency)
storage.store(validatedPayout)
PayoutResult.Success(validatedPayout)
} catch (e: PayoutValidationException) {
PayoutResult.Failure(e.error)
}
}
private fun validateAndCreatePayout(userId: String?, amount: BigDecimal?, currency: Currency?): Payout {
// Validate required fields
if (userId.isNullOrBlank()) {
throw PayoutValidationException(PayoutError.MissingUserId)
}
if (amount == null) {
throw PayoutValidationException(PayoutError.MissingAmount)
}
if (currency == null) {
throw PayoutValidationException(PayoutError.MissingCurrency)
}
// Validate amount constraints
if (amount <= BigDecimal.ZERO || amount > MAX_PAYOUT_AMOUNT) {
throw PayoutValidationException(PayoutError.InvalidAmount)
}
// Validate currency
if (currency !in ALLOWED_CURRENCIES) {
throw PayoutValidationException(PayoutError.InvalidCurrency)
}
// Check user total limit
val currentUserTotal = storage.getTotalPayoutByUserId(userId)
if (currentUserTotal + amount > MAX_USER_TOTAL) {
throw PayoutValidationException(
PayoutError.UserLimitExceeded(currentUserTotal, MAX_USER_TOTAL)
)
}
return Payout(userId = userId, amount = amount, currency = currency)
}
}
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
Red Flags Immediately Obvious:
- Hardcoded business rules (
MAX_PAYOUT_AMOUNT
,ALLOWED_CURRENCIES
) - Multiple responsibilities in one method (validation + business logic)
- Impossible to test in isolation - every test needs real storage
- No separation of concerns - adding new validation rules means modifying core logic
But I wanted to see what tests AI would generate for this mess.
The Test Generation Disaster
Problem 1: Configuration Chaos
My first attempt to run the tests failed immediately:
[ERROR] TestEngine with ID 'junit-jupiter' failed to discover tests
[ERROR] There was an error in the forked process
2
When I provided this error to Claude, its response was shocking:
"This is likely due to missing dependencies or configuration issues. Let me implement tests in Java + add a manual runner so you can run it if tests still won't work using JUnit."
Wait, what? Instead of fixing the Maven configuration, Claude:
- Switched from Kotlin to Java for tests (defeating the purpose)
- Created a manual test runner using
main()
methods - Suggested bypassing the testing framework entirely
This immediately revealed a fundamental problem: AI doesn't understand that broken infrastructure needs to be fixed, not worked around.
Problem 2: Shotgun Testing
Once I forced Claude to fix the configuration properly, it generated this test class:
class PayoutServiceTest {
@Test
fun `should process valid payout successfully`() { /* basic test */ }
@Test
fun `should fail when userId is null`() { /* null test */ }
@Test
fun `should fail when amount exceeds 30`() { /* boundary test */ }
@Test
fun `should allow all supported currencies`() {
// Test EUR
var result = payoutService.processPayout("user1", BigDecimal("10"), Currency.EUR)
assertTrue(result is PayoutResult.Success)
// Test USD
result = payoutService.processPayout("user2", BigDecimal("10"), Currency.USD)
assertTrue(result is PayoutResult.Success)
// Test GBP
result = payoutService.processPayout("user3", BigDecimal("10"), Currency.GBP)
assertTrue(result is PayoutResult.Success)
}
@Test
fun `should track payouts separately for different users`() { /* ... */ }
@Test
fun `should fail when user total would exceed 100`() { /* ... */ }
// ... 15 more similar tests
}
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Problems with this approach:
- Shotgun testing: One massive test class trying to cover everything
- Inefficient coverage: Tests like "allow all supported currencies" would need massive changes if we added more currencies
- No isolation: Every test depends on the monolithic service
- Impossible to maintain: Adding validation rules requires updating dozens of tests
Problem 3: The False Confidence
The most dangerous part was Claude's confidence:
"These tests provide comprehensive coverage of all validation scenarios and edge cases. The test suite ensures the service behaves correctly across all supported operations."
But when I looked closer:
- Tests were testing the implementation, not behavior
- No separation between different types of validation
- Impossible to test individual business rules in isolation
- Changes to any validation rule would break multiple tests
The Change Request: Adding Currency Restrictions
Now came the real test. I added a new requirement:
"Restrict specific users to use only certain currencies (e.g., User A can only use EUR)"
Claude's "Solution"
As expected, Claude made changes throughout the existing codebase:
Updated Service (now even messier):
class PayoutService(
private val storage: PayoutStorage,
private val currencyRestrictions: CurrencyRestrictions? = null
) {
private fun validateAndCreatePayout(userId: String?, amount: BigDecimal?, currency: Currency?): Payout {
// ... existing validation logic ...
// NEW: Validate user-specific currency restrictions
currencyRestrictions?.let { restrictions ->
if (!restrictions.isCurrencyAllowed(userId, currency)) {
val allowedCurrencies = restrictions.getAllowedCurrencies(userId) ?: ALLOWED_CURRENCIES
throw PayoutValidationException(
PayoutError.CurrencyNotAllowedForUser(userId, currency, allowedCurrencies)
)
}
}
// ... rest of validation ...
}
}
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
The Problems Multiplied:
- Even more responsibilities in the same method
- Optional dependencies making testing complex
- Validation order matters but isn't explicit
- Configuration scattered across multiple places
The Test Impact Explosion
Adding this single feature required changes to:
- 8 existing test methods (had to mock new dependency)
- 12 new test methods for currency restrictions
- Complex test setup with multiple mocks
- Parameterized tests that became unwieldy
Example of the resulting test complexity:
@ExtendWith(MockKExtension::class)
class PayoutServiceTest {
@InjectMockKs
private lateinit var payoutService: PayoutService
@MockK
private lateinit var storage: PayoutStorage
@MockK
private lateinit var currencyRestrictions: CurrencyRestrictions
@Test
fun `should reject payout when currency is not in user's allowed list`() {
// Given
every { currencyRestrictions.isCurrencyAllowed("user123", Currency.USD) } returns false
every { currencyRestrictions.getAllowedCurrencies("user123") } returns setOf(Currency.EUR)
// When
val result = payoutService.processPayout("user123", BigDecimal("10"), Currency.USD)
// Then
assertTrue(result is PayoutResult.Failure)
val error = (result as PayoutResult.Failure).error
assertTrue(error is PayoutError.CurrencyNotAllowedForUser)
}
// ... 30 more tests, each with complex mock setup
}
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
The Damning Discoveries
Discovery 1: AI Doesn't Test Behavior, It Tests Implementation
The generated tests were tightly coupled to the implementation details. They tested:
- How validation was implemented
- What order validations ran in
- Which exceptions were thrown where
Instead of testing:
- What business rules should be enforced
- When those rules should apply
- Why certain inputs should be valid/invalid
Discovery 2: Test Maintenance Becomes a Nightmare
Every change to business logic required:
- Updating multiple test methods (no clear separation)
- Modifying mock setups across dozens of tests
- Reorganizing test data to match new implementation
- Debugging test failures caused by implementation changes, not requirement changes
Discovery 3: False Coverage Confidence
The test coverage metrics looked great:
- 95% line coverage
- All branches tested
- Comprehensive edge case scenarios
But the tests provided zero confidence for refactoring or changing business rules because they were testing implementation, not behavior.
Discovery 4: AI Creates Tests That Look Right
This was the most insidious problem. The generated tests looked professional:
- Good naming conventions
- Proper test structure
- Comprehensive scenarios
- Clean assertions
But they were fundamentally flawed from an architecture perspective.
The Comparison: What TDD Would Have Produced
If I had followed proper TDD (like the conventions from my Phase 2 learnings), I would have:
Separate validators:
interface PayoutValidator {
fun validate(payout: Payout)
}
class AmountValidator(private val config: PayoutConfiguration) : PayoutValidator
class CurrencyValidator(private val config: PayoutConfiguration) : PayoutValidator
class UserLimitValidator(private val storage: PayoutStorage, private val config: PayoutConfiguration) : PayoutValidator
class CurrencyRestrictionValidator(private val restrictions: CurrencyRestrictions) : PayoutValidator
2
3
4
5
6
7
8
Clean service orchestration:
class PayoutService(
private val storage: PayoutStorage,
private val validators: List<PayoutValidator>
) {
fun process(payout: Payout) {
validators.forEach { it.validate(payout) }
storage.store(payout)
}
}
2
3
4
5
6
7
8
9
Focused, maintainable tests:
class AmountValidatorTest {
@Test
fun `should throw exception when amount exceeds configured limit`() {
every { config.getMaxAmount() } returns 30.0
val payout = PayoutMother.of(amount = 35.0)
val exception = shouldThrow<ValidationException> {
validator.validate(payout)
}
exception.code shouldBe AMOUNT_EXCEEDED
}
}
2
3
4
5
6
7
8
9
10
11
12
Adding currency restrictions would have required:
- One new validator class
- One new test class
- Zero changes to existing code
- Zero changes to existing tests
The Verdict: Test-After is an Anti-Pattern
The "generate tests for existing code" approach is fundamentally flawed because:
❌ It Encourages Poor Design
- Code written without tests tends toward monolithic structures
- No pressure to create testable, modular components
- Business logic gets mixed with infrastructure concerns
❌ Tests Become Implementation-Dependent
- Generated tests lock in current implementation
- Refactoring becomes impossible without rewriting tests
- Changes cascade through multiple test methods
❌ False Confidence in Coverage
- High coverage metrics don't mean good tests
- Tests pass but don't prevent regressions
- Missing edge cases aren't obvious
❌ Maintenance Nightmare
- Every feature addition requires updating multiple tests
- Test failures don't indicate requirement violations
- Debugging test issues becomes as complex as debugging production code
❌ AI Amplifies Anti-Patterns
- AI creates tests that look comprehensive but aren't
- No architectural pressure to write better code
- Quick feedback loop creates false sense of quality
Key Insights for VibeTDD
This experiment reinforced why test-first is crucial when working with AI:
- Tests as Design Pressure: Writing tests first forces you to think about interfaces and separation of concerns
- Behavior Over Implementation: TDD focuses on what the code should do, not how it does it
- Incremental Validation: Each test validates one specific behavior in isolation
- Refactoring Safety: Well-designed tests enable confident refactoring
- AI Needs Constraints: Without test-driven constraints, AI defaults to expedient but unmaintainable solutions
The Pattern Recognition
I'm starting to see a clear pattern across all VibeTDD experiments:
- Phase 1 (Calculator): Simple problem → AI TDD works well
- Phase 2 (Complex TDD): Complex problem → AI TDD breaks down
- Phase 2.1 (Test-After): Any complexity + test-after → Disaster
The conclusion is becoming clear: AI needs the discipline that TDD provides, but can't provide that discipline itself.
Next: Taking Control
Phase 2.1 confirmed my suspicions about the test-after approach. It's time for Phase 3: Human-led TDD with AI as implementation assistant.
The hypothesis: If I provide the architectural discipline through test-first design, can AI serve as an effective code generation tool while maintaining quality?
Let's find out if the test-first approach can harness AI's speed while avoiding the architectural disasters I've witnessed so far.
This experiment was eye-opening about how dangerous the "AI generates tests for existing code" approach really is. The code looks good, the tests pass, but the foundation is rotten. Next up: testing whether human-led TDD can keep AI on the right path. Follow the VibeTDD roadmap for the complete journey.
Code Repository
The complete code from this experiment is available at: VibeTDD Phase 2.1 Repository