Show HN: Hyperbrowser MCP Server – Connect AI agents to the web through browsers

xena · 2025-03-21T23:44:16 1742600656

Is there support for robots.txt so service operators can opt out of your mass scraping?

olivia-l · 2025-03-22T04:45:02 1742618702

Not only do they not respect robots.txt, but they publish an entire page[1] in their docs dedicated to circumventing scraping countermeasures.

I pointed their scraper at a url on my server to test it's behavior. It made four separate requests to the same page, three with the UA "udici"[2], and one with the UA "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36" and a different IP. The first three requests were all made within the span of 1 second, and the fourth 27 seconds later.

I emailed their published support address asking for an IP range and UA. They gave me the entire IP range of google cloud, and ignored the UA question.

This goes well beyond the "it's up to our users to implement responsible scraping practices" implication from the developer's other comment[3]. Instead, their service behaves maliciously by default, and they have implemented and documented switches that users can toggle for additional malicious scraping behavior. As far as I can tell, it is not even possible to implement a robots.txt-respecting scraper on top of this, because I couldn't find any mechanism for users to set a specific UA string.

[1]: https://docs.hyperbrowser.ai/sessions/advanced-privacy-and-a... (archived: https://web.archive.org/web/20250322045952/https://docs.hype..., http://archive.today/2025.03.22-050029/https://docs.hyperbro...)

[2]: https://github.com/nodejs/undici

[3]: https://news.ycombinator.com/item?id=43442116

dkh · 2025-03-22T06:40:40 1742625640

Nice work.

I feel like there's a lot to unpack here, and still much to discuss in the broader context.

There's a few things that can be excused or at least reasonably argued. Multiple IPs makes some sense since requests are being made through random proxies, and I don't think on its own demonstrates intention of bad behavior. If it was, I think you would've seen all four requests as coming from different IPs and with different user agents all posing as legitimate user browser sessions.

But the rest is inexcusable. Providing documentation on how to circumvent countermeasures without even acknowledging, or presenting a viewpoint on, the concern many have with the potential malicious use of these tools. Taking the time to respond to your inquiry about the IP range and UA but giving an answer that is somewhere between intentionally incomplete and intentionally misleading. (Is the actual IP range even within the large Google IP range given the source of the proxies as mentioned elsewhere in this thread?) Just very poor decisions on how to handle the extremely predictable resistance they were very obviously going to encounter

olivia-l · 2025-03-22T08:15:33 1742631333

> Multiple IPs makes some sense since requests are being made through random proxies, and I don't think on its own demonstrates intention of bad behavior.

Agreed, just using multiple IPs isn't malicious on it's own. I thought it was notable in the context of issuing another request with a generic browser UA. It's possible that the IP change was a deliberate strategy to avoid detection (like changing the UA likely is), but also possible that it was just a side effect of their infrastructure design, or a combination of the two.

> without even acknowledging, or presenting a viewpoint on, the concern many have with the potential malicious use of these tools.

So, amusingly, they seem to have added a "ethical scraping" page[1] to their docs in between me looking at this a few hours ago and now. (You can see that this page is missing from the sidebar in my archive link from earlier). I particularly enjoy the parts where they say "follow robots.txt rules" and "limit RPS on one site", because as far as I can tell it is actually not possible to do either of these things as a user of this service. There is no mechanism (that I could find) to set an identifiable user agent on the scraper client, nor a mechanism to control the delay between crawling different pages. It's not impossible that they have implemented a reasonable rate limit, with proper backoff when it appears the target is under load, but I wouldn't bet on it.

> Is the actual IP range even within the large Google IP range given the source of the proxies as mentioned elsewhere in this thread?

Good question! I am not able to test this, because they don't expose the proxies without paying them money, which I do not intend to do. My guess would be "no".

> still much to discuss in the broader context.

Yeah. The broader context is that most LLM scrapers are a plague, and I cannot wait for this bubble to pop. Until recently, we were getting upwards of 20k requests per day from individual LLM scrapers on a gitlab instance that I admin. The combination of malice and staggering incompetence with which these are operated is incredible. I have observed fun tactics like "switch to a generic UA and increase the crawling rate after being added to robots.txt" from the same scraper that isn't smart enough to realize that it doesn't need to crawl the same commit hash multiple times per hour. The bit that tells you not to get stuck crawling the CI pipeline results forever is there to protect you, silly.

Things are reportedly much worse[2] for admins of larger services. I saw this referred to as a "DDOS of the entire internet" a while ago, which is pretty accurate.

What we ended up doing is setting up an infinite maze of markov chain nonsense text that we serve to LLM scrapers at a few bytes per second. All they have to do to avoid it is respect robots.txt. I recommend this! It's fun and effective, and if we're lucky, it might cause harm to some people and systems that deserve it.

[1]: https://web.archive.org/web/20250322072210/https://docs.hype... [2]: https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/

dkh · 2025-03-22T14:02:34 1742652154

> The broader context is that most LLM scrapers are a plague, and I cannot wait for this bubble to pop.

My thing is, there are legitimate uses for automated browsing, uses that could be extremely useful and yet nondamaging to (or even supported by!) site operators. But we'll never get to have them if the tools/methods to implement them are the same as the ones used by people inadvertently DDOSing the site they're trying to inhale the entire contents of. For them to not get lumped together, purveyors of the tools cannot remain "neutral" or hide implementation details or condone, whether explicit or implicit, bad behavior. We've seen this happen on the web before, and we're already seeing desperate organizations implement nuclear-option LLM scraper blockers that also take out things like RSS readers. Anyways... I may just have to write something about this...

> So, amusingly, they seem to have added a "ethical scraping" page[1] to their docs in between me looking at this a few hours ago and now.

Congrats, you made an impact :)

soulofmischief · 2025-03-21T23:54:25 1742601265

Would you like to explain how directing your user agent to use the internet just as you would in order to complete a task or solve a problem is "mass scraping"?

dkh · 2025-03-22T06:55:18 1742626518

> to use the internet just as you would

Your premise is flawed, or at the very least, far from guaranteed. One could use tools like this to use to browse the internet only as they would otherwise. One could also use it for mass scraping, and many do. If you've looked at HN on previous days this week, there's been a front-page story nearly every day about problems resulting from exactly that.

The parent comment was perhaps a bit too snarky with the assumption that this could/would only be used for malicious behavior on a large scale, but your assumption that it wouldn't be is not any better, and also runs contrary to what people have been experiencing and discussing in this arena in recent days

ranger_danger · 2025-03-21T23:58:12 1742601492

Intent.

soulofmischief · 2025-03-22T03:35:08 1742614508

Care to elaborate? That is not a substantial argument.

shrisukhani · 2025-03-22T00:21:47 1742602907

No, we don't enforce any robots.txt restrictions ourselves. We also don't do any scraping ourselves. We provide browser infrastructure that operates like any normal browser would - what users choose to do with it is up to them. We're building tools that give AI agents the same web access capabilities that humans have, don't think it's our place to impose any additional limitations.

raggi · 2025-03-22T01:04:21 1742605461

It is 100% your responsibility what your servers do to other peoples servers in this context, and wanton negligence is not an excuse that will stop your servers from being evicted by hosting companies.

dkh · 2025-03-22T07:32:46 1742628766

You make the tools, what people do with them isn't up to you. I can tolerate some form of that opinion/argument on some level, but it is at the very least short-sighted on your part to not have been better-equipped for how to respond to concerns people have about potential misuse.

If what has been said elsewhere in this thread is true about providing documentation on how to circumvent attempts to detect/block your service and your resistance to providing helpful information such as IP ranges used and how user agents are set, then you have strayed far from being neutral and hands-off.

"it's not our place" is not actual neutrality, it's performative or complicit neutrality. Actual neutrality would be perhaps not providing ways to counter your service, but also not documenting how to circumvent people from trying. And if this is what your POV is, fine! You are far from alone--given the state of the automated browsing/scraping ecosystem right now, plenty of people feel this way. Be honest about it! Don't deflect questions. Don't give misleading answers/information. That's what carries this into sketchy territory

xena · 2025-03-22T01:16:20 1742606180

Do you publish an IP range?

TheTaytay · 2025-03-21T21:33:50 1742592830

This looks cool.

1) I looked at the pricing. Is search included in the price - (you just pay credits/browser time?)

2) Can you tell me more about the source of your residential proxies? I am new to this space, so don’t know how people source these legitimately.

Thanks!

shrisukhani · 2025-03-21T21:42:21 1742593341

Thanks!

1) Yep, you just pay from browser time and proxy usage

2) We use a handful of proxy providers under the hood ourselves. There’s a lot of shady ones but we only work with ones where we’ve vetted the source of. Different providers source proxies in different ways - directly from ISPs, paying end sources for proxies etc

ranger_danger · 2025-03-22T03:39:12 1742614752

What is your vetting process and who have you vetted?

pizzafeelsright · 2025-03-21T21:32:17 1742592737

Fantastic.

MCPs are showing promise.

shrisukhani · 2025-03-21T21:51:14 1742593874

Thanks!

And yeah MCP is super promising. We announced this on X and LinkedIn yesterday and the response has been really good. A lot of people with a bunch of use cases.

One surprising thing is there’s also a bunch of semi/non-technical people using our MCP server and the installation experience for them rn just absolutely sucks.

I think once auth and 1-click install are solved, MCP could become the standard way to integrate tools with LLMs

_pdp_ · 2025-03-22T00:51:45 1742604705

The MCP protocol is really not that good. The stateful nature of it makes it only suitable as a local desktop RPC pipe - certainly not something that will work well on mobile nor anything anyone would like to try to run maintain in a server-to-serve context.

It is fine if that is the scope. It is also understandable why Anthropic chose to use a stateful protocol where stateless HTTP would be more than enough. They are catering for the default transport layer which is stdio based where state needs to be established.

There are also other aspects of it that are simply unnecessarily complex and resource intensive for no good reasons.

pkkkzip · 2025-03-22T03:36:10 1742614570

what do you suggest instead of MCP ?

_pdp_ · 2025-03-22T10:13:49 1742638429

Well OpenAPI. You don't need some wired debugging tools nobody knows how to use, a stateful protocol that is hard to troubleshoot, etc. There is plenty of support already built into standard HTTP services and Swagger - abundance of tools and documentation too and what we call function calling is basically JSON Schema which is at the core of swagger definitions.

MCP is trying to reinvent OpenAPI but in the wrong way.

pkkkzip · 2025-03-22T20:18:44 1742674724

doesnt seem like a good equivalent

_pdp_ · 2025-03-22T21:59:59 1742680799

Why not? It is easy and widely available. It has support for many development tools, from browsers to rest clients, intercepting proxies, and more.

I guess it is not exciting because nobody can put their name on it.

fosterfriends · 2025-03-21T21:43:12 1742593392

++ love that folks are trying to build companies on MCP. Good luck!

shrisukhani · 2025-03-21T21:56:53 1742594213

Thanks! :)

dennisaxu · 2025-03-22T01:09:31 1742605771