it's just an example, but it's great to see smolagents in practice. I wonder how well the import whitelist approach works for code interpreter security.
I know some of the point of this is running things locally, but for agent workflows like this some of this seems like a solved problem: just run it on a throwaway VM. There's lots of ways to do that quickly.
VM is not the right abstraction because of performance and resource requirements. VMs are used because nothing exists that provides same or better isolation. Using a throwaway VM for each AI agent would be highly inefficient (think wasted compute and other resources, which is the opposite of what DeepSeek exemplified).
I mean performance overheads of an OS process running in a VM to (vs no VM) and additional resource requirements for running a VM, including memory and additional kernel. You can pull relevant numbers from academic papers.
Is “DeepSeek” going to be the new trendy way to say to not be wasteful? I don’t think DS is a good example here. Mostly because it’s a trendy thing, and the company still has $1B in capex spend to get there.
Firecracker has changed the nature of “VMs” into something cheap and easy to spin up and throw away while maintaining isolation. There’s no reason not to use it (besides complexity, I guess).
Besides, the entire rest of this is a python notebook. With headless browsers. Using LLMs. This is entirely setting silicon on fire. The overhead from a VM the least of the compute efficiency problems. Just hit a quick cloud API and run your python or browser automation in isolation and move on.
The rise of captchas on regular content, no longer just for posting content, could ruin this. Cloudflare and other companies have set things up to go through a few hand selected scrapers and only they will be able to offer AI browsing and research services.
I think the opposite problem is going to occur with captchas for whatever it's worth: LLMs are going to obsolete them. It's an arms race where the defender has a huge constraint the attacker doesn't (pissing off real users); in that way, it's kind of like the opposite dynamics that password hashes exploit.
I’m not sure about that. There’s a lot of runway left for obstacles that are easy for humans and hard/impossible for AI, such as direct manipulation puzzles. (AI models have latency that would be impossible to mask.) On the other hand, a11y needs do limit what can be lawfully deployed…
There’s a lot of runway left for obstacles that are easy for humans and hard/impossible for AI, such as direct manipulation puzzles.
That's irrelevant. Humans totally hate CAPTCHAs and they are an accessibility and cultural nightmare. Just forget about them. Forget about making better ones, forget about what AI can and can't do. We moved on from CAPTCHAs for all those reasons. Everyone else needs to.
OK, what you now call turnstyle. If I get one of those screens I just close the tab rather than wait several seconds for the algorithm to run and give me a green checkbox to proceed.
Totally different type of latency. A person with a motor disability dragging a puzzle piece with their finger will look very different from an AI model being called frame by frame.
Cloudflare is more than captchas, it's centralized monitoring of them too: what do you think happens when your research assistant solves 50 captchas in 5 min from your home IP? It has to slow down to human research speeds.
What about Cloudflare itself? It might constitute an abuse of sorts of their leadership position, but couldn’t they dominate the AI research/agent market if they wanted? (Or maybe that’s what you were implying too)