Curl-Impersonate

Curl-Impersonate

425

by jakeogh

cle

The same author also makes a Python binding of this which exposes a requests-like API in Python, very helpful for making HTTP reqs without the overhead of running an entire browser stack: https://github.com/lexiforest/curl_cffi

I can't help but feel like these are the dying breaths of the open Internet though. All the megacorps (Google, Microsoft, Apple, CloudFlare, et al) are doing their damndest to make sure everyone is only using software approved by them, and to ensure that they can identify you. From multiple angles too (security, bots, DDoS, etc.), and it's not just limited to browsers either.

End goal seems to be: prove your identity to the megacorps so they can track everything you do and also ensure you are only doing things they approve of. I think the security arguments are just convenient rationalizations in service of this goal.

throwaway99210

> I can't help but feel like these are the dying breaths of the open Internet though

I agree with the over zealous tracking by the megacorps but this is also due to bad actors, I work for a financial company and the amount of API abuse, ATO, DDoS, nefarious bot traffic, etc. we see on a daily basis is absolutely insane

berkes

But how much of this "bad actor" interaction is countered with tracking? And how many of these attempts are even close to successfull with even the simplest out of the box security practices set up?

And when it does get more dangerous, is over zealous tracking the best counter for this?

I've dealt with a lot of these threats as well, and a lot are countered with rather common tools, from simple fail2ban rules to application firewalls and private subnets and whatnot. E.g. a large fai2ban rule to just ban anything that attempts to HTTP GET /admin.php or /phpmyadmin etc, even just once, gets rid of almost all nefarious bot traffic.

So, I think the amount of attacks indeed can be insane. But the amount that need over zealous tracking is to be countered, is, AFAICS, rather small.

Szpadel

I can tell you about my experience with blocking traffic from scalpers bots that were very active during pandemic.

All requests produced by those bots were valid ones, nothing that could be flagged by tools like fail2ban etc (my assumption is that it would be the same for financial systems).

Any blocking or rate limiting by IP is useless, we saw about 2-3 requests per minute per IP, and those actors had access to ridiculous number of large CIDRs, blocking any IP caused it instantly replace it with another.

blocking by AS number was also mixed bag, as this list growed up really quickly, most of that were registered to suspicious looking Gmail addresses. (I feel that such activity might own significant percentage of total ipv4 space)

This was basically cat and mouse game of finding some specific characteristic in requests that matches all that traffic and filtering it, but the other side would adapt next day or on Sunday.

aggregated amount of traffic was in range of 2-20k r/s to basically heaviest endpoint in the shop, with was the main reason we needed to block that traffic (it generated 20-40x load of organic traffic)

cloudflare was also not really successful with default configuration, we had to basically challenge everyone by default with whitelist of most common regions from where we expected customers.

So best solution is to track everyone and calculate long term reputation.

berkes

TBC: I wasn't saying that F2B is a silver bullet. Not at all.

But that protection depends on the use case. And that in many of my use-cases, a simple f2b with a large hardcoded list of URL paths I guarantee to never have, will drop bot-traffic with 90% or more. The last 10% then split into "hits because the IP is new" and "other, more sophisticated bots". Bots, in those cases are mostly just stupid worms, just trying out known WP exploits, default passwords on often used tools (nextcloud, phpmyadmin, etc) and so on.

I've done something similar with a large list of known harvest/scraper bots, based on their user-agent (the nice ones), or their movements. Nothing complex, just things like "/hidden-page.html that's linked, but hidden with css/js.

And with spam bots, where certain post-requests can only come from repeatedly submitting the contact form.

This, obviously isn't going to give any protection against targeted attacks. Nor will it protect against more sophisticated bots. But in some -in my case, most- use-cases, it's enough to drop bot-traffic significantly.

codingminds

I've learned that Akamai has a service that deals with this specific problem, maybe this might interest you as well: https://www.akamai.com/products/content-protector

stareatgoats

Blocking scalper bot traffic by any means, be it by source or certified identification seems a lost cause, i.e. not possible because it can always be circumvented. Why did you not have that filter at point of sale instead? I'm sure there are reasons, but to have a battery of captchas and a limit on purchases per credit card seems on the surface much more sturdy. And it doesn't require that everyone browsing the internet announce their full name and residential address in order to satisfy the requirements of a social score ...

Szpadel

The product they tried to buy what not in stock anyways, but their strategy was to constantly try anyways, so in case it would become in stock they would be the first to get it. It was all for guest checkout, so no address yet to validate nor credit card. Because they used API endpoints used by the frontend we could not use any captcha at this place because of technical requirements.

As stated before the main reason we needed to block it was volume of the traffic, you migh imagine identical scenario for dealing with DDoS attack.

jsdwarf

Disabling guest checkout would have been my weapon of choice or at least requiring the user to enter an email address to so that they are notified when the product becomes available.

dspillett

> Because they used API endpoints used by the frontend we could not use any captcha at this place because of technical requirements

A time sensitive hash validating each request makes it a bit harder for them without significant extra work on your part. Address sensitive is much more effective but can result in issues for users that switch between networks (using your site on the move and passing between workers networks, for instance).

bornfreddy

> Because they used API endpoints used by the frontend we could not use any captcha at this place because of technical requirements.

That doesn't compute... Captcha is almost always used in such setups.

It also looks like you could just offer an API endpoint which would return if the article is in stock or not, or even provide a webhook. Why fight them? Just make the resource usage lighter.

I'm curious now though what the articles were, if you are at liberty to share?

Szpadel

We had captcha, but it was at later stage of the checkout process. This API endpoint needed to work from cached pages, so it could not contain any dynamic state in request.

Some bots checked product page where we had info if product is in stock (although they tried heavenly to bypass any caches by putting garbage in URL). This kind of bots also scaled instantly to thousands checkout requests when product become available with gave no time for auto scaling to react (this was another challenge here)

This was easy to mitigate so it didn't generate almost any load on the system.

I believe we had email notification available, but it could be too high latency way for them.

I'm not sure how much I can share about articles here, but I can say that those were fairly expensive (and limited series) wardrobe products.

shaky-carrousel

Hm, is probably too late, but you could have implemented in your API calls some kind of proof of work. Something that's not too onerous for a casual user but it is hard for someone trying multiple requests.

Szpadel

This was actually one of my ideas how to solve it, observed behaviour strongly suggested that all those thousands of IP addresses where used by single server. Even small PoW with this volume should heavly influence their capacity. But we decided that we did not want to affect performance of mobile users. We later learned that such strategy is also used by cloudflare js check

miki123211

> Why fight them? Just make the resource usage lighter.

Because you presumably want real, returning customers, and that means those customers need to get a chance at buying those products, instead of them being scooped up by a scalper the millisecond they appear on the website.

thatcat

maybe offer the item for preorder ?

geysersam

Sounds like a dream having customers scooping up your products the millisecond the appear on the website. They should increase their prices.

dspillett

No matter what the price, they would still have the "As stated before the main reason we needed to block it was volume of the traffic" problem that was stated above, for a popular item. In fact increasing the base price might attract even more scalpers and such.

sesm

I remember people doing this with PS5 when they were in short supply after release.

jillyboel

The best solution is to put everyone in a little cage and point and keep a permanent record of everything they do. This doesn't mean it's a desirable solution.

shwouchk

Require a verified account to buy high demand items.

cute_boi

why not charge people? This is the only solution I can think of.

throwaway99210

> E.g. a large fai2ban rule to just ban anything that attempts to HTTP GET /admin.php or /phpmyadmin etc, even just once, gets rid of almost all nefarious bot traffic.

unfortunately fail2ban wouldn't even make a dent in the attack traffic hitting the endpoints in my day-to-day work, these are attackers utilizing residential proxy infrastructure that are increasingly capable of solving JS/client-puzzle challenges.. the arms race is always escalating

JohnMakin

we see the same thing, also with a financial company, the most successful strategies we’ve seen is making stuff like this extremely expensive for whoever it is if we see it, and they stop or slow down to a point it becomes not worth it and they move on. sometimes that’s really all you can do without harming legit traffic.

josephcsible

Such a rule is a great way to let malicious users lock out a bunch of your legitimate customers. Imagine if someone makes a forum post and includes this in it:

  [img]https://example.com/phpmyadmin/whatever.png[/img]

RiverCrochet

That would be in the body of the request. OP is talking about URLs in the actual request, which is part of the header.

While I don't have experience with a great number of WAFs I'm sure sophisticated ones let you be quite specific on where you are matching text to identify bad requests.

As an aside, another "easy win" is assuming any incoming HTTP request for a dotfile is malicious. I see constant unsolicitied attempts to access `.env`, for example.

berkes

A lot of modern standards rely on .well-known urls to convey abilities, endpoints, related services and so on.

In my case, I never run anything PHP so I'll just plain block out anything PHP (same for python, lua, activedirectory etc). And, indeed, .htaccess, .env etc. A rather large list of hardcoded stuff that gets an instant-ban. It drops the bot-traffic with 90% or more.

These obviously aren't targeted attacks. Protecting against those is another issue alltogether.

josephcsible

When legitimate users viewed that forum post, their browsers would, in the course of loading the image, attempt to HTTP GET /phpmyadmin/whatever.png, with that being the URL in the actual request in the header.

mattpallissard

That's not the same type of bot net. Fail 2 ban simply is not going to work when you have a popular unauthenticated endpoint. You have hundreds of thousands of rps spread across thousands of legitimate networks that. The requests are always modified to look legitimate in a never ending game of whack-a-mole.

You wind up having to use things like tls fingerprinting with other heuristics to identify what to traffic to reject. These all take engineering hours and require infrastructure. It is SO MUCH SIMPLER to require auth and reject everything else outright.

I know that the BigCo's want to track us and you originally mentioned tracking not auth. But my point is yeah, they have malicious reasons for locking things down, but there are legitimate reasons too.

fijiaarone

Easy solution to rate limit. Require initial request to get 1 time token with a 1 second delay And then require valid requests to include the token. The token returned has a salt with something like timestamp and ip. That way they can only bombard the token generator.

get /token

Returns token with timestamp in salted hash

get /resource?token=abc123xyz

Check for valid token and drop or deny.

int0x29

As at least one person working on this has pointed out in this thread: their adversaries have IP blocks and ASNs.

sangnoir

> You wind up having to use things like tls fingerprinting

...and we've circled back to the post's subject - a version of curl that impersonates browsers TLS handshake behavior to bypass such fingerprinting.

miki123211

This depends on what you're fighting.

If you're fighting adversaries that go for scale, AKA trying to hack as many targets as possible, mostly low-sophistication, using techniques requiring 0 human work and seeing what sticks, yes, blocking those simple techniques works.

Those attackers don't ever expect to hack Facebook or your bank, that's just not the business they're in. They're fine with posting unsavory ads on your local church's website, blackmailing a school principal with the explicit pictures he stores on the school server, or encrypting all the data on that server and demanding a ransom.

If your company does something that is specifically valuable to someone, and there are people whose literal job it is to attack your company's specific systems, no, those simple techniques won't be enough.

If you're protecting a Church with 150 members, the simple techniques are probably fine, if you're working for a major bank or a retailer that sells gaming consoles or concert tickets, they're laughably inadequate.

jsnell

The question is a bit of a non sequitur, since this is not tracking. The TLS fingerprint is not a useful tracking vector, by itself nor as part of some composite fingerprint.

fijiaarone

The point is that you have to use an approved client (eg browser, os) with an approved cert authority that goes through approved gatekeepers (eg Cloudflare, Akamai)

jsnell

That seems pretty unlikely to be the original point of https://news.ycombinator.com/item?id=42549415, which mentions none of that, and doesn't even have directionally the same concerns.

But also, what you wrote is basically nonsense. Clients don't need "an approved cert authority". Nor are there any "approved gatekeepers", all major browsers are equally happy connecting to your Raspberry Pi as they are connecting to Cloudflare.

tialaramex

A big problem is that where we have a good solution you'll lose if you insist on that solution but other people get away with doing something that's crap but customers like better. We often have to mandate a poor solution that will be tolerated because if we mandate the better solution it will be rejected, and if we don't mandate anything the outcomes are far worse.

Today for example I changed energy company†. I made a telephone call, from a number the company has never seen before. I told them my name (truthfully but I could have lied) and address (likewise). I agreed to about five minutes of parameters, conditions, etc. and I made one actual meaningful choice (a specific tariff, they offer two). I then provided 12 digits identifying a bank account (they will eventually check this account exists and ask it to pay them money, which by default will just work) and I'm done.

Notice that anybody could call from a burner and that would work too. They could move Aunt Sarah's energy to some random outfit, assign payments to Jim's bank account, and cause maybe an hour of stress and confusion for both Sarah and Jim when months or years later they realise the problem.

We know how to do this properly, but it would be high friction and that's not in the interests of either the "energy companies" or the politicians who created this needlessly complicated "Free Market" for energy. We could abolish that Free Market, but again that's not in their interests. So, we're stuck with this waste of our time and money, indefinitely.

There have been simpler versions of this system, which had even worse outcomes. They're clumsier to use, they cause more people to get scammed AND they result in higher cost to consumers, so that's not great. And there are better systems we can't deploy because in practice too few consumers will use them, so you'd have 0% failure but lower total engagement and that's what matters.

† They don't actually supply either gas or electricity, that's a last mile problem solved by a regulated monopoly, nor do they make electricity or drill for gas - but they do bill me for the gas and electricity I use - they're an artefact of Capitalism.

code51

Much of this "bad actor" activity is actually customer needs left hanging - for either the customer to automate herself or other companies to fill the gap to create value that's not envisioned by the original company.

I'm guessing investors actually like a healthy dose of open access and a healthy dose of defence. We see them (YC, as an example) betting on multiple teams addressing the same problem. The difference is their execution, the angle they attack.

If, say, the financial company you work for is capable in both product and technical aspect, I assume it leaves no gap. It's the main place to access the service and all the side benefits.

miki123211

> Much of this "bad actor" activity is actually customer needs left hanging - for either the customer to automate herself or other companies to fill the gap to create value

Sometimes the customer you have isn't the customer you want.

As a bank, you don't want the customers that will try to log in to 1000 accounts, and then immediately transfer any money they find to the Seychelles. As a ticketing platform, you don't want the customers that buy tickets and then immediately sell them on for 4x the price. As a messaging app, you don't want the customers who have 2000 bot accounts and use AI to send hundreds of thousands of spam messages a day. As a social network, you don't want the customers who want to use your platform to spread pro-russian misinformation.

In a sense, those are "customer needs left changing", but neither you nor otherr customers want those needs to be automatible.

cle

Yep totally agree these are problems. I don't have a good alternative proposal either, I'm just disappointed with what we're converging on.

Comment was deleted :(

deadbabe

Even if the internet was wide open it’s of little use these days.

AI will replace any search you would want to do to find information, the only reason to scour the internet now is for social purposes: finding comments and forums or content from other users, and you don’t really need to be untracked to do all that.

A megacorp’s main motivation for tracking your identity is to sell you shit or sell your data to other people who want to sell you things. But if you’re using AI the amount of ads and SEO spam that you have to sift through will dramatically reduce, rendering most of those efforts pointless.

And most people aren’t using the internet like in the old days: stumbling across quaint cozy boutique websites made by hobbyists about some favorite topic. People just jump on social platforms and consume content until satisfied.

There is no money to be made anymore in mass web scraping at scale with impersonated clients, it’s all been consumed.

matheusmoreira

You are on point. There is no open internet without computing freedom.

Computers used to be empowering. Cryptography used to be empowering. Then these corporations started using both against us. They own the computers now. Hardware cryptography ensures the computers only run their software now, software that does their the corporation's bidding and enforces their controls. And if we somehow gain control of the computer we are denied every service and essentially ostracized. I don't think it will be long before we are banned from the internet proper for using "unauthorized" devices.

It's an incredibly depressing state of affairs. Everything the word "hacker" ever stood for is pretty much dying. It feels like there's no way out.

choeger

> ensure you are only doing things they approve of

Absolutely. They might not care about individuals, though. It's their approach to shape "markets". The Apple, Google, Amazon, and Microsoft tax is not inevitable and that's their problem. They will fight toe and nail to keep you locked in, call it "innovation", and even cooperate with governments (which otherwise are their natural enemy in the fight for digital control). It's the people that a) don't care much and b) don't have any options.

In the end, a large share of our wealth is just pulled from us to these ever more ridiculous rent seeking schemes.

octocop

"I have nothing to hide" will eventually spread to everyone. Very unfortunate.

cle

I'm in a similar boat but it's more like "I have nothing I can hide".

These days I just tell friends & family to assume that nothing they do is private.

Habgdnv

The answer is simple: I have something to hide. I have many things to hide actually. Nothing of these things is illegal currently but I still have many things to hide. And if I have something to hide - I can be worried about many things.

Dilettante_

"It's not that I have something to hide, there's simply nothing I want to show you."

schnable

A lot of the motivation comes from government regulations too. Right now this is mostly in banking, but social media and porn regs are coming too.

lelandfe

PornHub and all of its affiliate sites now block all residents of Alabama, Arkansas, Idaho, Indiana, Kansas, Kentucky, Mississippi, Montana, Nebraska, North Carolina, Texas, Utah, and Virginia (and Florida on Jan 1): https://www.pcmag.com/news/pornhub-blocked-florida-alabama-t...

Child safety, as always, was the sugar that made the medicine go down in freedom-loving USA. I imagine these states' approaches will try to move to the federal level after Section 230 dies an ignominious death.

Keep an eye out for Free Speech Coalition v. Paxton to hit SCOTUS in January: https://www.oyez.org/cases/2024/23-1122

1vuio0pswjnm7

https://github.com/lexiforest/curl_cffi/releases/expanded_as...

jagged-chisel

> … helpful for making HTTP reqs without the overhead of running an entire browser stack

For those less informed, add “to impersonate the fingerprints of a browser.”

One can, obviously, make requests without a browser stack.

userbinator

They've been planning this stuff for a long time...

https://en.wikipedia.org/wiki/Next-Generation_Secure_Computi...

...and we're seeing the puzzle pieces fall into place. Mandated driver signing, TPMs, and more recently remote attestation. "Security" has always been the excuse --- securing their control over you.

dwattttt

Another trending thread right now is Pegasus/Predator; as much as it may be a facade, to say MS (or any OS vendor) has no business working on security/secure computing is demonstrably false.

userbinator

I have no problem with them working on security in the service of the user. The problem is with them claiming to do that, but instead doing the opposite.

zouhair

The disappearance of the third space is killing us.

oefrha

What are some example sites where this is both necessary and sufficient? In my experience sites with serious anti-bot protection basically always have JavaScript-based browser detection, and some are capable of defeating puppeteer-extra-plugin-stealth even in headful mode. I doubt sites without serious anti-bot detection will do TLS fingerprinting. I guess it is useful for the narrower use case of getting a short-lived token/cookie with a headless browser on a heavily defended site, then performing requests using said tokens with this lightweight client for a while?

Retr0id

A lot of WAFs make it a simple thing to set up. Since it doesn't require any application-level changes, it's an easy "first move" in the anti-bot arms race.

At the time I wrote this up, r1-api.rabbit.tech required TLS client fingerprints to match an expected value, and not much else: https://gist.github.com/DavidBuchanan314/aafce6ba7fc49b19206...

(I haven't paid attention to what they've done since so it might no longer be the case)

oefrha

Makes sense, thanks.

jonatron

There are sites that will block curl and python-requests completely, but will allow curl-impersonate. IIRC, Amazon is an example that has some bot protection but it isn't "serious".

ekimekim

In most cases this is just based on user agent. It's widespread enough that I just habitually tell requests not to set a User Agent at all (these aren't blocked, but if the UA contains "python" it is).

thrdbndndn

Lots of sites, actually.

> I doubt sites without serious anti-bot detection will do TLS fingerprinting

They don't set it up themselves. CloudFlare offer such thing by default (?).

oefrha

Pretty sure it’s not default, and Cloudflare browser check and/or captcha is a way bigger problem than TLS fingerprinting, at least was the case the last time I scraped a site behind Cloudflare.

Avamander

CloudFlare offers it. Even if it's not used for blocking it might be used for analytics or threat calculations, so you might get hit later.

remram

Those JavaScript scripts often get data from some API, and it's that API that will usually be behind some fingerprinting wall.

jandrese

The build scripts on this repo seem a bit cursed. It uses autotools but has you build them in a subdirectory. The default built target is a help text instead of just building the project. When you do use the listed build target it doesn't have the dependencies set up correctly so you have to run it like 6 times to get to the point where it is building the application.

Ultimately I was not able to get it to build because the BoringSSL disto it downloaded failed to build even though I made sure all of the dependencies the INSTALL.md listed are installed. This might be because the machine I was trying to build it on is an older Ubuntu 20 release.

Edit: Tried it on Ubuntu 22, but BoringSSL again failed to build. The make script did work better however, only requiring a single invocation of make chrome-build before blowing up.

Looks like a classic case of "don't ship -Werror because compiler warnings are unpredictable".

Died on:

/extensions.cc:3416:16: error: ‘ext_index’ may be used uninitialized in this function [-Werror=maybe-uninitialized]

The good news is that removing -Werror from the CMakeLists.txt in BoringSSL got around that issue. Bad news is that the dependency list is incomplete. You will also need libc++-XX-dev and libc++abi-XX-dev where the XX is the major version number of GCC on your machine. Once you fix that it will successfully build, but the install process is slightly incomplete. It doesn't run ldconfig for you, you have to do it yourself.

On a final note, despite the name BoringSSL is huge library that takes a surprisingly long time to build. I thought it would be like LibreSSL where they trim it down to the core to keep the attack surface samll, but apparently Google went in the opposite direction.

ospider

Hi, maintainer here, the whole project is a hack, actually :P

The original repo was already full of hacks, and on top of that, I added more hacks to keep up with the latest browsers. The main purpose of my fork is to serve as a foundation of the python binding, which I think is easier to use. So I haven't tried to make the whole process more streamlined as long as it works on the CI. You can use the prebuilt binaries on the release page, though. I guess I should find some time to clean up the whole thing.

userbinator

Look on the bright side: the harder it is to build and use correctly, the harder it is for the enemy to analyse and react.

jakeogh

I hit that too, there is an open bug: https://github.com/lexiforest/curl-impersonate/issues/81

Worked around it by modifying the patch: https://github.com/jakeogh/jakeogh/blob/master/net-misc/curl...

Considering the complexity, this project, and it's upstream parent and grandparent(curl proper) are downright amazing.

at0mic22

Played this game and switched to prebuilt libraries. Think builder docker images have also been broken for a while.

that's exactly why I stopped using C/C++. building is many times a nightmare, and the language teams seems to have no interest in improving the situation

zlagen

In case anyone is interested, I created something similar but for python(using chromium's network stack) https://github.com/lagenar/python-cronet I'm looking for help to create the build for windows.

Klonoar

Similar projects exist for C# (https://github.com/sleeyax/CronetSharp), Go (https://github.com/sleeyax/cronet-go) and Rust (https://github.com/sleeyax/cronet-rs).

These can work well in some cases but it's always a tradeoff.

hk__2

Any reason you didn’t use https://github.com/lexiforest/curl_cffi?

zlagen

I wanted to try a diffent approach which is to use chromium's network stack directly instead of patching curl to impersonate it. In this case you're using the real thing so it's a bit easier to maintain when there are changes in the fingerprint.

thrdbndndn

Any plan to offer a sync API?

Sytten

Thankfully only a small fraction of website does JA3/JA4 fingerprinting. Some do more advanced stuff like correlating headers to the fingerprint. We have been able to get away without doing much in Caido for a long time but I am working on an OSS rust based equivalent. Neat trick, you can use the fingerprint of our competitor (Burp Suite) since it is whitelisted for the security folks to do their job. Only time you will not hear me complain about checkbox security.

Retr0id

I recently used ja3proxy, which uses utls for the impersonation. It exposes an HTTP proxy that you can use with any regular HTTP client (unmodified curl, python, etc.) and wraps it in a TLS client fingerprint of your choice. Although I don't think it does anything special for http/2, which curl-impersonate does advertise support for.

https://github.com/LyleMi/ja3proxy

https://github.com/refraction-networking/utls

peetistaken

https://github.com/bogdanfinn/tls-client is the go-to package for the go world, it does the same thing

kerblang

Interesting in light of another much-discussed story about AI scraper farms swamping/DDOSing sites https://news.ycombinator.com/item?id=42549624

TekMol

What is the use case? If you have to read data from one specific website which uses handshake info to avoid being read by software?

When I have to do HTTP requests these days, I default to a headless browser right away, because that seems to be the best bet. Even then, some website are not readable because they use captchas and whatnot.

adastral

> I default to a headless browser

Headless browsers consume orders of magnitude more resources, and execute far more requests (e.g. fetching images) than a common webscraping job would require. Having run webscraping at scale myself, the cost of operating headless browsers made us only use them as a last resort.

at0mic22

Blocking all image/video/CSS requests is the rule of thumb when working with headless browsers via CDP

sangnoir

Speaking as a person who has played on both offense and defense: this is a heuristic that's not used frequently enough by defenders. Clients that load a single HTML/JSON endpoint without loading css or image resources associated with the endpoints are likely bots (or user agents with a fully loaded cache, but defenders control what gets cached by legit clients and how). Bot data thriftiness is a huge signal.

zzo38computer

Even legitimate users might want to disable CSS and pictures and whatever, and I often do when I just want to read the document.

Blind users also might have no use for the pictures, and another possibility is if the document is longer than the screen so the picture is out of view then the user might program the client software to use lazy loading, etc.

sangnoir

Indeed, that's why it's one heuristic/signal among many others

at0mic22

As a high load system engineer you'd want to offload asset serving to CDN which makes detection slightly more complicated. The easy way is to attach an image onload handler with client js, but that would give a high yield of false positives. I personally have never seen such approach and doubt its useful for many concerns.

sangnoir

Unless organization policy forces you to, you do not have to put all resources behind a CDN. As a matter of fact, getting this heuristic to work requires a non-optimal caching strategy of one or more real or decoy resources - CDN or not. "Easy" is not an option for the bot/anti-bot arms race, all the low hanging fruit is now gone when fighting a determined adversary on either end.

> I personally have never seen such approach and doubt its useful for many concerns.

It's an arms race and defenders are not keen on sharing their secret sauce, though I can't be the only one who thought of this rather basic bot characteristic, multiple abuse trams probably realized this decades ago. It works pretty well against the low-resource scrapers with fakes UA strings and all the right TLS handshakes. It won't work against the headless browsers that costs scrapers more in resources and bandwidth, and there are specific countermeasures for headless browsers [1], and counter-countermeasures. It's a cat and mouse game.

1. e.g. Mouse movement, as made famous as ine signal evaluated by Google's reCAPTCHA v2, monitor resolution & window size and position, and Canvas rendering, all if which have been gradually degraded by browser anti-fingerprinting efforts. The bot war is fought on the long tail.

TekMol

So you maintain a table of domains and how to access them?

How do you build that table and keep it up to date? Manually?

mschuster91

> What is the use case? If you have to read data from one specific website which uses handshake info to avoid being read by software?

Evade captchas. curl user agent / heuristics are blocked by many sites these days - I'd guess many popular CDNs have pre-defined "block bots" stuff that blocks everything automated that is not a well-known search engine indexer.

jollyllama

>The Client Hello message that most HTTP clients and libraries produce differs drastically from that of a real browser.

Why is this?

throwaway99210

Based on what I've seen, most command-line clients and basic HTTP libraries typically ship with leaner, more static configurations (e.g., no GREASE extensions in the Client Hello, limited protocols in the ALPN extension header, smaller number of Signature Algorithms). Mirroring real browser TLS fingerprints is also more difficult due to the randomization of the Client Hello parameters (e.g., current versions of Chrome)

zlagen

They use different SSL libraries/configuration. Chrome uses BoringSSL and other libraries may use OpenSSL or some other library. Besides that the SSL library may be configured with different cipher suites and extensions. The solution these impersonators provide is to use the same SSL library and configuration as a real browser.

Retr0id

The protocols are flexible and most browsers bring their own HTTP+TLS clients

0x676e67

I think someone should need this. It is based on boring tls and makes some fake extensions similar to utls to support Firefox TLS fingerprint imitation

repo: https://github.com/penumbra-x/rquest

jakeogh

(very rough) ebuild: https://github.com/jakeogh/jakeogh/blob/master/net-misc/curl...

userbinator

I can't help but think that projects like these shouldn't be posted here, since the enemy is among us. Prodding the bear even more might lead to an acceleration towards the dystopia that others here have already prophesised.

The following browsers can be impersonated.

...unfortunately no Firefox to be seen.

I've had to fight this too, since I use a filtering proxy. User-agent discrimination should be illegal. One may think the EU could have some power to change things, but then again, they're also hugely into the whole "digital identity" thing.

ospider

Maintainer here. Curl drops NSS support since like a year ago, which is the SSL engine firefox uses. Without NSS, two special extensions can not be added. And that's why only webkit-based browsers are left.

You can find support for old firefox versions in the original repo.

crtasm

It says "Firefox(In progress)", and the original project this was forked from has it: https://github.com/lwthiker/curl-impersonate

aninteger

I think we should list the sites where this fingerprinting is done. I have a suspicion that Microsoft does it for conditional access policies but I am not sure of other services.

Galanwe

We cannot really list them, as 90% of the time, it's not the websites themselves, it's their WAF. And there is a trend toward most company websites to be behind a WAF nowadays to avoid 1) annoying regulations (US companies putting geoloc on their websites to avoid EU cookie regulations) and 2) DDoS.

It's now pretty common to have cloudflare, AWS, etc WAFs as main endpoints, and these do anti bots (TLS fingerprinting, header fingerprinting, Javascript checks, capt has, etc).

pixelesque

Cloudflare (which seems to be fronting half the web these days based off the number of cf-ray cookies that I see being sent back) does this with bot protection on, and Akamai has something similar I think.

londons_explore

> The resulting curl looks, from a network perspective, identical to a real browser.

How close is it? If I ran wireshark, would the bytes be exactly the same in the exact same packets?

jsnell

The packets from Chrome wouldn't be exactly the same as packets sent by Chrome at a different time either. "The exact same packets" is not a viable benchmark, since both the client and the server randomize the payloads in various ways. (E.g. key exchange, GREASE).

peetistaken

You can check your fingerprint on https://tls.peet.ws

dchest

What else could "identical" mean?

londons_explore

It could be that the TCP streams are the same, but packetiation is different.

It could mean that the packets are the same, but timing is off by a few milliseconds.

It could mean a single HTTP request exactly matches, but when doing two requests the real browser uses a connection pool but curl doesn't. Or uses HTTP/3's fast-open abilities, etc.

etc.

Retr0id

Two TLS streams are never byte-identical, due to randomness inherent to the protocol.

Identical here means having the same fingerprint - i.e. you could not write a function to reliably distinguish traffic from one or the other implementation (and if you can then that's a bug).

zlagen

It replicates the browser at the HTTP/SSL level, not TCP. From what I know this is good enough to bypass cloudflare's bot detection.

ape4

I like this project!

Is there a way to request impersonization of the current version of Chrome (or whatever)?

jakeogh

The latest version is a moving target, currently you get the following chrome versions:

  $ curl_chrome <TAB><TAB>
  curl_chrome100
  curl_chrome101
  curl_chrome104
  curl_chrome107
  curl_chrome110
  curl_chrome116
  curl_chrome119
  curl_chrome120
  curl_chrome123
  curl_chrome124
  curl_chrome131
  curl_chrome131_android
  curl_chrome99
  curl_chrome99_android

ape4

Perhaps plain `curl_chrome` could use the latest available `curl_chromeNNN`

ElizabethVio52

[dead]

Crafted by Rajat

Source Code

hckrnws

Curl-Impersonate