Hacker News new | past | comments | ask | show | jobs | submit | cleverfoo's comments login

I think the big problem here is conceptual. The JDK folks are looking at this akin to PGO when, IMHO, they should be looking at this as an AOT cache (yes, the flag names make this even more confusing). How do those two differ, you ask?

With PGO you do a lot of deliberate work to profile your application under different conditions and feed that information back to the compiler to make better branch/inlining decisions. With a AOT cache, you do nothing up front, and the JVM should just dump a big cache to disk every time it exits just in case it gets stared again on the same host. In this case, training runs would just be a” run you did to create the cache". With that said, the big technical challenge right ow is that building the AOT cache is expensive hence performance impacting and cannot really be done alongside a live application - but that’s where I think the focus should be, making filling the aot cache something less intensive and automatic.

Another aspect this strategy would help with is “what to do with these big AOT cache files”, if the AOT cache really starts caching every compiled method, it will become essentially another so file possibly of a size greater than the original JAR it started off with. Keeping this is in a docker image will double the size of the image slowing down deployments. Alternatively, with the aot cache concept, you just need to ensure there is some form of persistent disk cache across your hosts. The same logic also significantly helps CLIs, where I dont’ want to ship a 100MB CLI + Jlink bundle and have to add another 50MB of aot cache in it - what I do want is every time the client uses my CLI the JVM keeps improving the AOT cache.


It would be nice to be able to trigger AOT somehow, e.g. as part of a Docker build, or as part of an app startup as you say. Then the software deployment can decide what to do.


Is there any reason to think Java code can’t be statically linked, and then dead code eliminated (for that specific build of the app)?

I’m not asking if the tooling currently exists, I’m curious if there’s something inherent in .class files that would prevent static linking.


> I’m not asking if the tooling currently exists, I’m curious if there’s something inherent in .class files that would prevent static linking.

It's not so much a problem with the .class files, instead it's a problem with reflection.

I can write `var foo = Class.forName("foo.bar.Baz")` which will cause the current class loader to look up and initialize the `foo.bar.Baz` class if it's available. I can then reflectively initialize an instance of that class by calling `foo.newInstance()`

Java has a ton of really neat meta-programming capabilities (and those will increase with the new ClassFile api). Unfortunately, those make static compilation and dead code elimination particularly hard. Tools that allow for static compilation (like graal) basically push the dev to declare upfront which classes will be accessed via reflection.


Well, assuming that by "statically linking" you mean in the c sense, that's exactly what GraalVM native image does today, it statically analyzes the JAR for reachability only compiling the methods/classes in use. This works but it's also what makes native-image difficult to use and brittle.

It's hard, and some might argue impossible, to statically analyze reachability in a dynamic language like java that allows for runtime class loading and redefinition. As it turns out, Java is much closer to javascript than C++ in terms of dynamic runtime behavior.


In the Java sense, if you properly utilize JPMS, JLink can cut dead modules and reduce your image size drastically. This obviously of course like you said depends on how "open" your runtime model is. If you're not dynamically loading jars it works really well.


Native Image does exactly that already, but producing an AOT-compiled-and-linked native executable is not the goal, it's just a means to some goal. The real question is what is it that you want to optimise? Is it startup/warmup time? Is it the size of the binary? Peak performance? Developer productivity? Program functionality? Ops capabilities?

AOT compilation certainly doesn't give you the best outcome for all of these concerns (regardless of the language).


The Leyden team are looking to do exactly what you're looking for. There will be further JEPs.


I built Scanii [1], an unsafe/malware content detection API/SaaS, as a way to keep my coding skills sharp as I moved into engineering leadership roles. Over the years it has grown into a lovely $35k/month business while spending $0 in marketing thanks to our amazing customers.

My advice to aspiring entrepreneurs: get it out there quick, listen to your customers and be ready to act on their feedback. Finding product/market fit is a journey even if you are selling into the most well understood vertical since it's not just about what the market expects it's about what your engineering talent/capacity can delver in a reasonable amount of time.

[1] https://www.scanii.com


Impressive! What do you think it is that you do that allows you to compete with VirusTotal, and even free tools like Jotti?

Presumably you're now using commercial AV tools, rather than Clam? Did you have to get some kind of special license from them to use it like this?


> Impressive! What do you think it is that you do that allows you to compete with VirusTotal, and even free tools like Jotti?

Thanks and good question. We don't really compete with virus total since it's more of a research tool and, for a while, their terms explicitly prohibited commercial use (but I think that has changed). Jotti is a similar thing, more of a research tool than a high performance API you can use to build commercial products on.

> Presumably you're now using commercial AV tools, rather than Clam? Did you have to get some kind of special license from them to use it like this?

Yeah the product has expended a bunch over the years and we use multiple detection engines [2] to catch all kinds of unsafe content. But you are right, we do license a commercial AV engine to act as a backup to our own to ensure best possible detection rates. The licensing process warrants a blog post of its own since it's not what I would call easy.

[2] https://docs.scanii.com/article/149-how-do-the-different-det...


Congratulations on your success!

> get it out there quick

What did "getting it out there" consist of for you? How did you get it out there in the beginning?


> Congratulations on your success!

That is very kind of you, thank you.

> What did "getting it out there" consist of for you? How did you get it out there in the beginning?

For Scanii in particular, the original product was a thin wrapper around an open source AV engine, a hacked on a weekend UX, and a credit card processing integration to collect payment - the very minimal needed to find out if _anyone_ was willing to pay for this service.

With that said, what worked for me in this case is not what I would focus here since it depends on what kind of business you are trying to build. What I do believe is important is focusing on the economics of your space which, for IT, is all about productivity or, more succinctly, saving people's time - they pay you X for something that could cost them, in terms of people's time, Y to do.

So, what you want to ask yourself is whether signing up, paying and onboarding onto your product (the X in the equation above) is significantly lower than the next best alternative, either doing the same on a competitor product or building something themselves - the Y above.

For scanii, even at launch it saved people lots of time managing and operating malware detection engines which are cumbersome and hard to keep up to date. I had a feeling that would be the case when I launched but I couldn't be sure until our first customer voted with their credit card.


I guess more of what I'm asking is, how did you launch? The biggest problem I have is getting word out about my product, which always ends up killing them.

I'm a developer by trade, so marketing and the like isn't my forte. Did you use ProductHunt? Indiehackers? Reddit? Twitter? Word of mouth?

Thanks for the reply btw, very helpful info. I've been iterating on something in my free time that fills a very small budgeting niche. I'm going to wrap it up with a website this weekend and see if I can gain any traction. At the very least, I know that if I had found the product I've built for $1.99 a month or something, I would have just paid the money. Hoping that others feel the same.


Got it, in that case it helps to build a product for a community you can interact with. In my case, this was connecting with folks on Stackoverflow that were struggling with integrating malware detection into their apps... that was all the marketing I did to get the product validated - but keep in mind that was 10 years or so back.

Best of luck with your launch!


Thanks for all the advice!


https://www.scanii.com a content arbitration/malware API service. It has been profitable for over 10+ years now with customers around the globe.

Building it was one of the best decisions I made in my life since it enabled me to make hard decisions at work that were not skewed by the fear of losing my job and not being able to provide for my family - I'm in engineering/product leadership.

But, do not be fooled, this also means I've had two jobs (albeit of unequal urgency) and that, obviously, equates to long work hours.


For the last 9+ years I've worked on https://scanii.com, a content identification service (think of it as the unix file command on steroids wrapped around an easy to use API). Started with a real MVP hacked on a weekend (https://web.archive.org/web/20101209005314/http://scanii.com...) after identifying the need on a day job I had a long time ago. With 0 marketing and sales it took a while to start gaining traction but I always knew that we were solving a real problem with a good and fair-priced product. Nowadays it’s big enough to be classified as a lifestyle business and that’s all right by me.


Not impressed, particularly with the basic-auth description. Basic auth is purely a well understood vehicle for sending a tuple (aka the credentials) for authenticating a HTTP request, most of the concerns highlighted are with regards to how the credentials are acquired and potentially reused across requests - that has nothing to do with the HTTP protocol. For example, my API product scanii.com has used basic auth for 7+ years and I firmly believe it strikes the right balance between security and easy of use. Besides fairly complex key/secret tuples for server side usage, we also provide one-time auth tokens for when you want to make API calls directly from a web browser (or another insecure device).


Author here. The doc does acknowledge basic auth as a valid approach for server side usage, but the language was a bit too pessimistic about basic auth.

I just modified it to say the following

    If you use HTTPS, then basic authentication is a good choice for server side only applications. It is the easiest to get started, and is well supported by clients and server frameworks.


Basic auth is only OK if you don't have :80 open. Clients will send the creds over the clear to :80. It doesn't matter if you reply with a 400, the creds are already compromised at that point.


Even if you don't have :80 open, that doesn't mean there isn't a MITM that would accept the connection instead of you.


As long as https:// prefix is used, this is not true, MITM cannot downgrade that.


plus a HSTS header for any type-in traffic.


As a rule, HTTP basic auth is inappropriate without SSL, even for intranet apps. An office could easily have an insecure Wifi point, and someone sitting outside running Wireshark.


Also important to note that even on secure Wi-Fi with WPA2, if the attacker knew the password to the network they can just as easily sniff such plaintext ocontent.


It prevents delegation, though, which is an important use case for some APIs. As in "I want this third-party service to read my calendar from another server, but I don't want it to hijack my account, and I want to be able to revoke its access later on."


That's an implementation detail of how the credentials are created, managed, interpreted by the server, and their use reported on to the user, none of which is specific to the credentials transport or encoding, which is all basic auth is. The thing to be aware of is how different HTTP clients, specifically user-interactive browsers, use (apply and remember) the credentials.


If you reinterpret Basic auth as "send a token that's not the user's password in the Authorization header", you're just doing OAuth 2 but writing "Basic" instead of "Bearer".


And if you dig deeper in this direction, you will find yourself Greenspunned into Kerberos.


We migrated scanii.com from Amazon Simple Payments to Stripe subscriptions (after the whole FPS debacle) and haven't looked back, it's truly the best way to process payments right now. If I could buy Stripe stock I would.


Let me see if I can try to simplify the underlying problem here (I dabble in this space):

Little bit os background: writing pattern matching signatures is hard, adding a bunch of "known malicious" hashes to your malware database is easy.

So, company A with a staff of folks writing pattern matching signatures has its engine added to VirusTotal and virus total shares/sell hashes found by that engine to folks that pay for its API. Company B, without a staff of engineers writing pattern matching signatures, signs up for VirtualTotal API and creates its malware database based purely on the hashes other actual engines create.

Two important things to keep in mind, when this happens at the scale of VirusTotal (basically all real engines are participating) the end result "hash database" is, essentially, bullet proof since it's likely that any sample used to test its effectiveness will be run by VirusTotal first.

We (I run scanii.com a malware/content detection API service) run into this all the time with folks either abusing or just not understanding the reason VT exists.


>bullet proof since it's likely that any sample used to test its effectiveness will be run by VirusTotal first.

Nope. There are lots of situations where exploit kits will automatically re-compile and re-pack malware on-demand in ways sufficiently complex that they eliminate any signatures and evade AV detection.

A lot of companies are using VT as a filter for known bad to prevent even having to deal with such samples, but many unknown bad samples still exist and make it past the VT engine, only to be picked up by behavioral detection.

Conversely, a small number of known bad samples that are caught by VT can slip by behavioral detection engines that are gated by VT, causing infection (when VT is removed) where it would otherwise be prevented. Of course, in these cases, it is the fault of the behavioral vendor for not having sufficient behavioral detection, but relying on VT does make that easier. For instance, many companies have a loop where they can take samples detected by VT, run them constantly through an automated analysis lab, and see whether or not their behavioral analysis detects each sample. In the cases where it fails, that sample has a direct line to analysts who can reverse engineer it, come up with new behavioral patterns, and add it to training sets for any machine learning based detection. In this sense, not having VT support makes everything less safe.

The next issue is that companies like this simply can't be run on VT's platform because they're too heavy, as the article mentions. I think a good middle ground here would be to turn this analysis loop into a feedback loop by adding one more step: in cases where behavioral detects and VT does not, submit the report to VT in a standardized format so it can be added to their corpus.


The tricky part there is that it wouldn't work if you just sat there in a tight look dispatching http requests, any one of them timing out would, likely, trigger the deadline and make all subsequent http requests not happen.

So, alternatively, you could do something with DynamoDB event sources, where you have some sort of pub/sub table that your lambda functions listen on (basically a list of all the http requests that have to happen) - thus keeping a minimal 1 lambda dispatch per http request. The catch is you would need another system to manage that table (technically that system can be lambda itself).

Two important things, 1) I haven't used the dynamodb/lambda integration myself so be skeptical of my suggestion and 2) what I can say from our usage of the s3/lambda integration is that concurrency is not a problem with thousands of lambda dispatches/second being surprisingly quick to spin up.


Hi there, author here, happy to answer any questions.


Excellent write up, kudos for using IAM and roles for this. We are working on implementing the very same system, we might just re-use your code. Thanks for sharing!


Thank you!


Yeah I totally agree. As much as I like lambdas in JDK8 the syntax feels super bolt on. What's wrong with just {}, -> is just silly.


Whats wrong with the () -> {} syntax? Its nice and simple, and scales down well, so a -> a.b()


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: