A code bug that most LLMs don't catch

A few months ago I spotted a (critical) software bug that could produce loss of data in one of my projects.

The affected code is a background job that cleans some old data based on the data retention period set by the user in the application settings.

Here’s the unpatched C# code (the containing class is not shown here):

public async Task RemoveOldPiiData(RemoveOldPiiDataCommand request, CancellationToken cancellationToken)
{
    List<UserEntity> users = await db.Users.ToListAsync(cancellationToken);

    foreach (var user in users)
    {
        int months = user.PiiRetentionMonths;

        if (months < 1)
        {
            logger.LogError("User {UserId} has a suspect PII retention period of {Months} months. Skipping",
                user.Id.Value, months);
            continue;
        }

        var sw = Stopwatch.StartNew();

        Instant retentionDate = SystemClock.Instance.GetCurrentInstant().Minus(Duration.FromDays(months * 30));

        await db.Shipments
            .Where(x => x.ArchivedAt < retentionDate)
            .ExecuteUpdateAsync(update => update
                    .SetProperty(x => x.RecipientName, x => null)
                    .SetProperty(x => x.RecipientEmailAddress, x => null),
                cancellationToken);

        logger.LogInformation(
            "Cleaned PII for user {UserId} with retention of {Months} months in {ElapsedMilliseconds:0}ms",
            user.Id.Value, months, sw.Elapsed.TotalMilliseconds);
    }
}

This method loops through all the users in the database and then cleans the data belonging to the specified user.

Or at least it should… Since I forgot to filter by user ID in the delete query, data belonging to a user might be removed based on the retention period of other users.

Oops! In hindsight, this is a very obvious bug: there’s a loop going through a list of users but the user is never used in a meaningful way, which isn’t hard to notice when you read the code.

I’ve asked several LLMs to list the bugs and issues they see in this code, and, suprisingly, most of the models are unable to detect this bug (or not consistently) without follow up questions.

LLM	Tested on	Reasoning	Bug found
GPT-4o	ChatGPT		❌
GPT-4o-mini	Kagi Assistant		❌
o3-mini	ChatGPT	medium?	✅
Claude 3.5 Sonnet	Claude.ai		⚠️ (sometimes)
Claude 3.7 Sonnet	Claude.ai		✅
Claude 3.5 Haiku	Kagi Assistant		❌
Claude 3 Opus	Kagi Assistant		❌
Gemini 2.0 Flash	Gemini		⚠️ (sometimes)
Gemini 2.0 Flash Thinking	Gemini	experimental	⚠️ (sometimes)
DeepSeek V3	DeepSeek		❌
DeepSeek R1	DeepSeek	yes	✅
DeepSeek R1 Distill Llama 70B	Kagi Assistant	yes	❌
Grok 3	X		❌
Grok 3 Thinking	X	beta	⚠️ (sometimes)
Pixtral Large	Kagi Assistant		❌
Mistral Large	Kagi Assistant		❌

Not surprisingly, reasoning models are the best at this task.

But most other models aren’t: they instead usually list very generic suggestions like improving logging, adding more error handling, considering that months aren’t exactly 30 days long on average, and other creative theories about potential deadlocks.

It may be that the LLMs are missing some context here (like knowing the full structure of the involved entity classes), or that the prompt isn’t detailed enough, but I still think they should at least hint at something going on in that query. This is a major and evident bug with security implications.

LLMs are of course not deterministic so different runs sometimes lead to better (or worse) results:

GPT-4o seems to always fail, even when the initial prompt suggests to look for security issues, while o3-mini always detects the issue and puts it in the first place (even with a rough prompt).
Claude 3.5 Sonnet sometimes gives a short answer highlighting just the critical bug, and sometimes produces long answers with lots of talk and no actual bugs.
In this Gemini 2.0 Flash chat, the model spots the issue and puts it at the second position, while here (slightly different prompt) it doesn’t see the issue at all.
In one Flash Thinking execution the model listed some bugs with a supposed severity of “high”, while not including the actual bug, and in other executions it found the issue and correctly classified it as major. (Chats with reasoning can’t be shared.)
DeepSeek V3 lists a weird “bug”: the word Skipping should be spelled Skipping instead of Skipping (they’re identical, yeah…).
DeepSeek R1 reasoning may be the longest among these models but then produces a very concise answer, always putting the «critical bug» at first position.
Grok 3 Thinking sometimes highlights the bug as critical, and sometimes doesn’t catch it at all.
All the other models always fail.

UPDATE: Claude 3.7 Sonnet was released one day after this article was published, and it seems it always spots the issue.

In my limited testing if you then ask follow up questions or a second review the LLM usually spots the issue.

Feel free to run your tests with the above code and let me know how it goes!