Skip to content
On this page

VibeTDD Experiment 2: When AI Leads a Real TDD Challenge

This is Phase 2 of my VibeTDD series. After the calculator experiment showed promise, it was time for a real test.

The Challenge: From Toy to Reality

After Claude successfully guided me through TDD basics with a calculator, I decided to escalate dramatically. No more toy problems - time for a real coding challenge from Portfo.

The Task: Build a payout service with these requirements:

  • Validate payout data (UserId, Amount, Currency)
  • Amount must not exceed 30
  • Only EUR, USD, GBP currencies allowed
  • Sum of all user payouts must not exceed 100
  • Store valid payouts in memory
  • Handle validation errors gracefully

The Rules (almost the same as Phase 1):

  • Claude leads the entire TDD process
  • I only implement what it tells me to
  • No TDD guidance from me initially
  • When Claude asks what to do next, I'd say "decide yourself"

But this time, I was watching more carefully for anti-patterns.

What Went Wrong: The Over-Engineering Begins

Problem 1: Test Explosion

Claude started strong with the first test:

kotlin
@Test
fun `should store payout when all data is valid`() {
    val payout = Payout("user123", 25.0, "EUR")
    
    payoutService.processPayout(payout)
    
    // Should not throw exception
}
1
2
3
4
5
6
7
8

But then it suggested 15 additional test cases for the same basic functionality:

kotlin
@Test  
fun `should store payout with minimum amount`()

@Test
fun `should store payout with maximum amount`() 

@Test
fun `should store payout with EUR currency`()

@Test
fun `should store payout with USD currency`()

@Test
fun `should store payout with GBP currency`()

// ... and 10 more similar tests
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Red Flag #1: AI defaults to exhaustive testing rather than minimal, focused tests. It was testing every permutation instead of using triangulation.

Problem 2: The Object Mother Catastrophe

When Claude started creating Payout objects directly in test bodies, I stepped in:

"This will become unmaintainable. Let's use the Object Mother pattern."

Claude's response was... excessive:

kotlin
object PayoutMother {
    // "Valid" variations
    fun withSmallAmount() = Payout("user123", 1.0, "EUR")
    fun withMediumAmount() = Payout("user123", 15.0, "EUR") 
    fun withLargeAmount() = Payout("user123", 29.0, "EUR")
    fun withMaxAmount() = Payout("user123", 30.0, "EUR")
    
    // Currency variations
    fun withEUR() = Payout("user123", 25.0, "EUR")
    fun withUSD() = Payout("user123", 25.0, "USD")
    fun withGBP() = Payout("user123", 25.0, "GBP")
    
    // User variations
    fun withUser1() = Payout("user1", 25.0, "EUR")
    fun withUser2() = Payout("user2", 25.0, "EUR")
    
    // ... 15 more similar methods
}

// And then, the horror:
object InvalidPayoutMother {
    fun withNegativeAmount() = Payout("user123", -5.0, "EUR")
    fun withZeroAmount() = Payout("user123", 0.0, "EUR")
    fun withExcessiveAmount() = Payout("user123", 31.0, "EUR")
    fun withInvalidCurrency() = Payout("user123", 25.0, "JPY")
    fun withEmptyUserId() = Payout("", 25.0, "EUR")
    // ... more invalid variations
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

Red Flag #2: AI treated Object Mother as a factory for every possible test scenario rather than a single, flexible method with valid and random default values.

The correct approach would have been:

kotlin
object PayoutMother {

    fun of(
        userId: String = Rand.string(),
        amount: Double = Rand.amount(),
        currency: String = Rand.currency(),
    ) = Payout(
        userId = userId,
        amount = amount,
        currency = currency,
    )
}
1
2
3
4
5
6
7
8
9
10
11
12

Problem 3: Classic TDD is Impossible with AI

The traditional red-green-refactor cycle that worked beautifully for the calculator completely fell apart:

What Should Happen (Classic TDD):

  1. Write one failing test
  2. Make it pass with minimal code
  3. Refactor
  4. Repeat

What Actually Happened:

  1. Claude writes 5-10 tests at once
  2. Suggests implementing everything simultaneously
  3. No triangulation or incremental development
  4. Skips the "minimal code" phase entirely

But here's the deeper problem: Classic TDD isn't just ineffective with AI - it's impossible for practical reasons:

  • Context Explosion: Each red-green cycle adds more conversation history
  • Memory Consumption: AI sessions blow up quickly with back-and-forth iterations
  • Time Overhead: The constant switching between test/implementation becomes prohibitively slow
  • Session Limits: You'll hit token limits before completing any meaningful feature

My Solution - The VibeTDD Principle: Instead of writing tests one-by-one, write small sets of related tests first, then implement them together. This batching approach:

  • Reduces context switching overhead
  • Keeps AI sessions manageable
  • Maintains test-first discipline
  • Allows for better validation of test completeness
kotlin
// Instead of: Write one test → implement → write next test → implement
// Do this: Write a focused set of tests → verify they fail → implement together

@Test
fun `should throw exception when UserId is empty`() { /* ... */ }

@Test  
fun `should throw exception when UserId is null`() { /* ... */ }

@Test
fun `should not throw exception when UserId is valid`() { /* ... */ }

// Then implement UserIdValidator to make all three pass
1
2
3
4
5
6
7
8
9
10
11
12
13

Problem 4: Too Proactive (And Imprecise Instructions)

Claude kept coding without permission:

"Now let's implement the validation logic..." [proceeds to write 50 lines of code]

I learned to be extremely specific:

❌ "Continue with the next step"
❌ "Implement the validation"
❌ "Proceed"

✅ "Write only the test for empty UserId validation"
✅ "Implement only the UserId validation method"
✅ "Show me the next single test case"

Problem 5: Missing Engineering Fundamentals

Despite leading TDD, Claude missed basic software engineering principles:

No Separation of Concerns:

kotlin
class PayoutService {
    fun processPayout(payout: Payout) {
        // Validation logic mixed with business logic
        if (payout.userId.isEmpty()) throw Exception("...")
        if (payout.amount <= 0) throw Exception("...")
        if (payout.currency !in listOf("EUR", "USD", "GBP")) throw Exception("...")
        
        storage.store(payout) // Business logic
    }
}
1
2
3
4
5
6
7
8
9
10

No Mocking in Tests:

kotlin
@Test
fun `should validate payout data`() {
    val service = PayoutService(InMemoryStorage()) // Real dependency!
    // ...
}
1
2
3
4
5

Hardcoded Business Rules:

kotlin
if (payout.amount > 30.0) // Magic number!
1

Non-compilable Code:

kotlin
// Claude confidently presented this:
shouldThrow<ValidationException> { // Wrong import
    service.process(invalidPayout)
}
1
2
3
4

The Moment of Intervention

After watching Claude create an unmaintainable mess while being "sure everything is perfect," I had to step in:

"This violates the Single Responsibility Principle. Let's separate validation into individual validator classes."

Claude's response: "You're absolutely right! I violated the Single Responsibility Principle..."

It knew the principles but didn't apply them without explicit prompting.

What We Built (With Heavy Guidance)

After course-correcting, we ended up with a properly architected solution:

Domain Model

kotlin
data class Payout(
    val userId: String,
    val amount: Double,
    val currency: String
)
1
2
3
4
5

Validator Interface

kotlin
interface PayoutValidator {
    fun validate(payout: Payout)
}
1
2
3

Individual Validators

kotlin
class UserIdValidator : PayoutValidator {
    override fun validate(payout: Payout) {
        if (payout.userId.isEmpty()) {
            throw InvalidPayoutException(
                PayoutErrorCode.EMPTY_USER_ID,
                "UserId cannot be empty"
            )
        }
    }
}

class AmountValidator(
    private val configuration: PayoutConfiguration
) : PayoutValidator {
    override fun validate(payout: Payout) {
        if (payout.amount <= 0) {
            throw InvalidPayoutException(
                PayoutErrorCode.INVALID_AMOUNT,
                "Amount must be greater than zero"
            )
        }
        
        val maxAmount = configuration.getMaxAmount()
        if (payout.amount > maxAmount) {
            throw InvalidPayoutException(
                PayoutErrorCode.INVALID_AMOUNT,
                "Amount cannot exceed $maxAmount"
            )
        }
    }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

Service Orchestration

kotlin
class PayoutService(
    private val storage: PayoutStorage,
    private val validators: List<PayoutValidator>
) {
    fun process(payout: Payout) {
        validators.forEach { it.validate(payout) }
        storage.store(payout)
    }
}
1
2
3
4
5
6
7
8
9

Proper Error Handling

kotlin
enum class PayoutErrorCode {
    EMPTY_USER_ID,
    INVALID_AMOUNT,
    INVALID_CURRENCY,
    USER_LIMIT_EXCEEDED
}

class InvalidPayoutException(
    val code: PayoutErrorCode,
    message: String
) : Exception(message)
1
2
3
4
5
6
7
8
9
10
11

Clean Tests

kotlin
@ExtendWith(MockKExtension::class)
class AmountValidatorTest {
    
    @InjectMockKs
    private lateinit var validator: AmountValidator
    
    @MockK
    private lateinit var configuration: PayoutConfiguration
    
    @ParameterizedTest
    @ValueSource(doubles = [0.0, -5.0, -100.0])
    fun `should throw exception when amount is zero or negative`(amount: Double) {
        val payout = PayoutMother.of(amount = amount)
        
        val exception = shouldThrow<InvalidPayoutException> {
            validator.validate(payout)
        }
        exception.code shouldBe PayoutErrorCode.INVALID_AMOUNT
    }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Key Discoveries

✅ What AI Does Well

  • Fast implementation once architecture is defined
  • Comprehensive test case generation (almost too comprehensive)
  • Pattern recognition - can apply consistent patterns across similar classes
  • Refactoring assistance - good at mechanical code improvements

⚠️ What Needs Heavy Human Oversight

  • Architectural decisions - defaults to simplest (often wrong) approach
  • Separation of concerns - mixes responsibilities without prompting
  • Test strategy - over-tests simple scenarios, under-tests complex ones
  • Dependency management - avoids mocking, uses real dependencies

❌ What AI Struggles With

  • TDD discipline - wants to write everything at once
  • Minimal implementations - jumps to complete solutions immediately
  • Context management - loses track of current focus with complex requirements
  • Quality assessment - confident about objectively poor code

The Scalability Problem

The most concerning discovery: AI-led TDD doesn't scale with complexity.

  • Calculator (10 lines): Excellent TDD discipline
  • Payout Service (200+ lines): Required constant human intervention
  • Real application (1000+ lines): Would be unmanageable

AI seems to have a complexity threshold where its behavior changes fundamentally.

Lessons for VibeTDD

1. Classic TDD Must Be Adapted for AI

The one-test-at-a-time approach is incompatible with AI collaboration. VibeTDD Principle: Write small, focused sets of tests first, then implement together. This reduces context overhead while maintaining test-first discipline.

2. AI Amplifies Your Approach

If you don't provide structure and conventions, AI will create its own - and they won't be good.

3. Micro-Management is Required

With complex requirements, you need to break work into tiny, discrete chunks. AI can't maintain context across large feature implementations.

4. Architecture Must Be Human-Led

AI defaults to the simplest possible structure, which is rarely the right structure for maintainable software.

5. Testing Strategy Needs Curation

AI generates exhaustive tests rather than strategic tests. It doesn't understand the difference between essential coverage and paranoid over-testing.

The Verdict

VibeTDD Phase 2 was a humbling experience. While AI can certainly generate code that passes tests, it cannot maintain the discipline and architectural thinking that makes TDD valuable.

The real insight: TDD's value isn't just about having tests, it's about the thinking process that creates good design. AI can execute TDD mechanics but can't do TDD thinking.

Next: The Role Reversal

For Phase 3, I'm flipping the script completely. Instead of letting Claude lead, I'll drive the TDD process myself and use AI as an implementation assistant.

The hypothesis: If humans provide the discipline and architecture through test design, AI might be an excellent implementation partner.

Will Claude write better code when constrained by human-designed tests? Can TDD serve as quality guardrails for AI-generated code? Let's find out.


This experiment revealed the boundaries of AI-led development more clearly than I expected. The next phase will test whether human-led TDD can harness AI's speed while maintaining quality. Follow along with the VibeTDD roadmap to see how this evolves.

Code Repository

The complete code from this experiment is available at: VibeTDD Phase 2 Repository

Built by software engineer for engineers )))