Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Hyperbrowser MCP Server – Connect AI agents to the web through browsers (github.com/hyperbrowserai)
63 points by shrisukhani 50 days ago | hide | past | favorite | 26 comments
Hi HN! Excited to share our MCP Server at Hyperbrowser - something we’ve been working on for a few days. We think it’s a pretty neat way to connect LLMs and IDEs like Cursor / Windsurf to the internet.

Our MCP server exposes seven tools for data collection and browsing:

1. `scrape_webpage` - Extract formatted (markdown, screenshot etc) content from any webpage

2. `crawl_webpages` - Navigate through multiple linked pages and extract LLM-friendly formatted content

3. `extract_structured_data` - Convert messy HTML into structured JSON

4. `search_with_bing` - Query the web and get results with Bing search

5. `browser_use_agent` - Fast, lightweight browser automation with the Browser Use agent

6. `openai_computer_use_agent` - General-purpose automation using OpenAI’s CUA model

7. `claude_computer_use_agent` - Complex browser tasks using Claude computer use

You can connect the server to Cursor, Windsurf, Claude desktop, and any other MCP clients with this command `npx -y hyperbrowser-mcp` and a Hyperbrowser API key. We're running this on our cloud browser infrastructure that we've been developing for the past few months – it handles captchas, proxies, and stealth browsing automatically.

Some fun things you can do with it: (1) deep research with claude desktop, (2) summarizing the latest HN posts, (3) creating full applications from short gists in Cursor, (3) automating code review in cursor, (4) generating llms.txt for any website with windsurf, (5) ordering sushi from windsurf (admittedly, this is just for fun - probably not actually going to do this myself).

We're building this server in the open and would love feedback from anyone building agents or working with web automation. If you find bugs or have feature requests, please let us know! One big issue with MCPs in general is that the installation UX sucks and auth credentials have to be hardcoded. We don’t have a solution to this right now but Anthropic seems to be working on something here so excited for that to come out. Love to hear any other complaints / thoughts you have about the server itself, Hyperbrowser, or the installation experience.

You can check us out at https://hyperbrowser.ai or check out the source code at https://github.com/hyperbrowserai/mcp




Is there support for robots.txt so service operators can opt out of your mass scraping?


Not only do they not respect robots.txt, but they publish an entire page[1] in their docs dedicated to circumventing scraping countermeasures.

I pointed their scraper at a url on my server to test it's behavior. It made four separate requests to the same page, three with the UA "udici"[2], and one with the UA "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36" and a different IP. The first three requests were all made within the span of 1 second, and the fourth 27 seconds later.

I emailed their published support address asking for an IP range and UA. They gave me the entire IP range of google cloud, and ignored the UA question.

This goes well beyond the "it's up to our users to implement responsible scraping practices" implication from the developer's other comment[3]. Instead, their service behaves maliciously by default, and they have implemented and documented switches that users can toggle for additional malicious scraping behavior. As far as I can tell, it is not even possible to implement a robots.txt-respecting scraper on top of this, because I couldn't find any mechanism for users to set a specific UA string.

[1]: https://docs.hyperbrowser.ai/sessions/advanced-privacy-and-a... (archived: https://web.archive.org/web/20250322045952/https://docs.hype..., http://archive.today/2025.03.22-050029/https://docs.hyperbro...)

[2]: https://github.com/nodejs/undici

[3]: https://news.ycombinator.com/item?id=43442116


Nice work.

I feel like there's a lot to unpack here, and still much to discuss in the broader context.

There's a few things that can be excused or at least reasonably argued. Multiple IPs makes some sense since requests are being made through random proxies, and I don't think on its own demonstrates intention of bad behavior. If it was, I think you would've seen all four requests as coming from different IPs and with different user agents all posing as legitimate user browser sessions.

But the rest is inexcusable. Providing documentation on how to circumvent countermeasures without even acknowledging, or presenting a viewpoint on, the concern many have with the potential malicious use of these tools. Taking the time to respond to your inquiry about the IP range and UA but giving an answer that is somewhere between intentionally incomplete and intentionally misleading. (Is the actual IP range even within the large Google IP range given the source of the proxies as mentioned elsewhere in this thread?) Just very poor decisions on how to handle the extremely predictable resistance they were very obviously going to encounter


> Multiple IPs makes some sense since requests are being made through random proxies, and I don't think on its own demonstrates intention of bad behavior.

Agreed, just using multiple IPs isn't malicious on it's own. I thought it was notable in the context of issuing another request with a generic browser UA. It's possible that the IP change was a deliberate strategy to avoid detection (like changing the UA likely is), but also possible that it was just a side effect of their infrastructure design, or a combination of the two.

> without even acknowledging, or presenting a viewpoint on, the concern many have with the potential malicious use of these tools.

So, amusingly, they seem to have added a "ethical scraping" page[1] to their docs in between me looking at this a few hours ago and now. (You can see that this page is missing from the sidebar in my archive link from earlier). I particularly enjoy the parts where they say "follow robots.txt rules" and "limit RPS on one site", because as far as I can tell it is actually not possible to do either of these things as a user of this service. There is no mechanism (that I could find) to set an identifiable user agent on the scraper client, nor a mechanism to control the delay between crawling different pages. It's not impossible that they have implemented a reasonable rate limit, with proper backoff when it appears the target is under load, but I wouldn't bet on it.

> Is the actual IP range even within the large Google IP range given the source of the proxies as mentioned elsewhere in this thread?

Good question! I am not able to test this, because they don't expose the proxies without paying them money, which I do not intend to do. My guess would be "no".

> still much to discuss in the broader context.

Yeah. The broader context is that most LLM scrapers are a plague, and I cannot wait for this bubble to pop. Until recently, we were getting upwards of 20k requests per day from individual LLM scrapers on a gitlab instance that I admin. The combination of malice and staggering incompetence with which these are operated is incredible. I have observed fun tactics like "switch to a generic UA and increase the crawling rate after being added to robots.txt" from the same scraper that isn't smart enough to realize that it doesn't need to crawl the same commit hash multiple times per hour. The bit that tells you not to get stuck crawling the CI pipeline results forever is there to protect you, silly.

Things are reportedly much worse[2] for admins of larger services. I saw this referred to as a "DDOS of the entire internet" a while ago, which is pretty accurate.

What we ended up doing is setting up an infinite maze of markov chain nonsense text that we serve to LLM scrapers at a few bytes per second. All they have to do to avoid it is respect robots.txt. I recommend this! It's fun and effective, and if we're lucky, it might cause harm to some people and systems that deserve it.

[1]: https://web.archive.org/web/20250322072210/https://docs.hype... [2]: https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/


> The broader context is that most LLM scrapers are a plague, and I cannot wait for this bubble to pop.

My thing is, there are legitimate uses for automated browsing, uses that could be extremely useful and yet nondamaging to (or even supported by!) site operators. But we'll never get to have them if the tools/methods to implement them are the same as the ones used by people inadvertently DDOSing the site they're trying to inhale the entire contents of. For them to not get lumped together, purveyors of the tools cannot remain "neutral" or hide implementation details or condone, whether explicit or implicit, bad behavior. We've seen this happen on the web before, and we're already seeing desperate organizations implement nuclear-option LLM scraper blockers that also take out things like RSS readers. Anyways... I may just have to write something about this...

> So, amusingly, they seem to have added a "ethical scraping" page[1] to their docs in between me looking at this a few hours ago and now.

Congrats, you made an impact :)


Would you like to explain how directing your user agent to use the internet just as you would in order to complete a task or solve a problem is "mass scraping"?


> to use the internet just as you would

Your premise is flawed, or at the very least, far from guaranteed. One could use tools like this to use to browse the internet only as they would otherwise. One could also use it for mass scraping, and many do. If you've looked at HN on previous days this week, there's been a front-page story nearly every day about problems resulting from exactly that.

The parent comment was perhaps a bit too snarky with the assumption that this could/would only be used for malicious behavior on a large scale, but your assumption that it wouldn't be is not any better, and also runs contrary to what people have been experiencing and discussing in this arena in recent days


Intent.


Care to elaborate? That is not a substantial argument.


No, we don't enforce any robots.txt restrictions ourselves. We also don't do any scraping ourselves. We provide browser infrastructure that operates like any normal browser would - what users choose to do with it is up to them. We're building tools that give AI agents the same web access capabilities that humans have, don't think it's our place to impose any additional limitations.


It is 100% your responsibility what your servers do to other peoples servers in this context, and wanton negligence is not an excuse that will stop your servers from being evicted by hosting companies.


You make the tools, what people do with them isn't up to you. I can tolerate some form of that opinion/argument on some level, but it is at the very least short-sighted on your part to not have been better-equipped for how to respond to concerns people have about potential misuse.

If what has been said elsewhere in this thread is true about providing documentation on how to circumvent attempts to detect/block your service and your resistance to providing helpful information such as IP ranges used and how user agents are set, then you have strayed far from being neutral and hands-off.

"it's not our place" is not actual neutrality, it's performative or complicit neutrality. Actual neutrality would be perhaps not providing ways to counter your service, but also not documenting how to circumvent people from trying. And if this is what your POV is, fine! You are far from alone--given the state of the automated browsing/scraping ecosystem right now, plenty of people feel this way. Be honest about it! Don't deflect questions. Don't give misleading answers/information. That's what carries this into sketchy territory


Do you publish an IP range?


This looks cool.

1) I looked at the pricing. Is search included in the price - (you just pay credits/browser time?)

2) Can you tell me more about the source of your residential proxies? I am new to this space, so don’t know how people source these legitimately.

Thanks!


Thanks!

1) Yep, you just pay from browser time and proxy usage

2) We use a handful of proxy providers under the hood ourselves. There’s a lot of shady ones but we only work with ones where we’ve vetted the source of. Different providers source proxies in different ways - directly from ISPs, paying end sources for proxies etc


What is your vetting process and who have you vetted?


Fantastic.

MCPs are showing promise.


Thanks!

And yeah MCP is super promising. We announced this on X and LinkedIn yesterday and the response has been really good. A lot of people with a bunch of use cases.

One surprising thing is there’s also a bunch of semi/non-technical people using our MCP server and the installation experience for them rn just absolutely sucks.

I think once auth and 1-click install are solved, MCP could become the standard way to integrate tools with LLMs


The MCP protocol is really not that good. The stateful nature of it makes it only suitable as a local desktop RPC pipe - certainly not something that will work well on mobile nor anything anyone would like to try to run maintain in a server-to-serve context.

It is fine if that is the scope. It is also understandable why Anthropic chose to use a stateful protocol where stateless HTTP would be more than enough. They are catering for the default transport layer which is stdio based where state needs to be established.

There are also other aspects of it that are simply unnecessarily complex and resource intensive for no good reasons.


what do you suggest instead of MCP ?


Well OpenAPI. You don't need some wired debugging tools nobody knows how to use, a stateful protocol that is hard to troubleshoot, etc. There is plenty of support already built into standard HTTP services and Swagger - abundance of tools and documentation too and what we call function calling is basically JSON Schema which is at the core of swagger definitions.

MCP is trying to reinvent OpenAPI but in the wrong way.


doesnt seem like a good equivalent


Why not? It is easy and widely available. It has support for many development tools, from browsers to rest clients, intercepting proxies, and more.

I guess it is not exciting because nobody can put their name on it.


++ love that folks are trying to build companies on MCP. Good luck!


Thanks! :)


dope




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: