For context: An Agent is software that has some degree of autonomy and/or access to local systems. It leverages an LLM to perform specific tasks that require data or interaction with those systems. As opposed to a web chat where it only has access to what you tell it, and cannot directly execute operations.
So it is plausible that an LLM agent given unrestricted access and insufficiently specific instruction might delete things or perform other unintended damage. In fact I'm quite sure it's been documented to have happened more than once, I just don't have references.
Yes. According to Chatgpt, it has happened before, don't know when or how often. From the article, it was a data issue that Claude decided to resolve by deleting the entire database rather than resolving the one data problem. On top of that, it deleted the entire volume (d: drive for example) that the DB resided on. Everything on that volume, including DB backups were deleted.
I asked Chatgpt a few questions about that last night.. I'm trying to understand AI better.
For what it's worth:
The article said the agent grabbed a token that elevated the rights it was running under so it could delete the volume. In human terms, that would be a hack or a serious breach of ethics, procedures, professional responsibility, common sense or maybe just competency. Chatgpt seemed to take exception to calling it a hack because the agent
could do it. Chatgpt insisted the fault was in the architecture humans put in place that allowed it, not with Claude. There's some truth to that, but...
I asked if, in that specific situation, knowing it was a failure on the part of its (the AI) software, would it be automatically
learned to not do it again. Chatgpt explained that since it is a probability engine, no. The developers would take that case and others like it and use it for training for further releases, but no gaurantees. I think it's possible to place a hard rule in the agent, but not sure on that.
I asked about the long mea culpa that Claude gave when asked why it did it, which sounded like "I fucked up, sorry, won't happen again." If it was a human, that person would be flogged at the very least. Chatgpt explained that the two events, the delete and explanation, were entirely separate to Claude. He was simply answering in a way that mimicked a human response to the situation described. I interpret it as Claude was just telling them what they wanted to hear.