Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Odigos (YC W23) – Instant distributed tracing for Kubernetes clusters
162 points by edenfed on Jan 19, 2023 | hide | past | favorite | 52 comments
Hi HN! We’re Eden and Ari, co-founders of Odigos (https://github.com/keyval-dev/odigos). Odigos is an open-source project that lets you instantly generate distributed traces for your applications. It works alongside existing monitoring tools and does not require any code changes.

Our earlier experiences with monitoring tools were frustrating. Monitoring a distributed system with multiple microservices, we found ourselves spending way too much time trying to locate the specific microservice that was at the root of a problem. For example, we once spent hours debugging an application which we suspected was causing high latency, only to find out that the actual problem was rooted in a completely different application

Then we learned about distributed tracing, which solves exactly this problem. Unlike metrics or logs that capture a data point in time in a single application, a distributed trace follows a request as it propagates through a distributed environment by tagging it with a unique ID. This allows developers to understand the context of each request and how their distributed applications work.

The downside is that it is difficult to implement. Unlike metrics or logs, the value of distributed tracing is gained only after implementing it across multiple applications. If even one of your applications does not produce distributed tracing, the context propagation is broken and the value of the traces drops significantly.

We manually implemented distributed tracing for multiple companies, but found it a challenge to coordinate all the development teams to instrument their applications in order to achieve a complete distributed trace. Once the implementation was finished, we saw great value and fixed production issues much faster. But partial implementation wasn’t worth much.

We set out to automate this process. We knew how to do most of it, but the trickiest part was how to automatically instrument programs written in compiled languages (like Go). If we could do that, we would be able to automate the entire process of generating distributed traces. While researching, we realized that eBPF—a technology that allows the Linux kernel to load external programs for execution within the kernel—could be used to develop automatic instrumentation for compiled languages. That was the final piece of the puzzle, and with it we were able to develop Odigos.

Odigos first scans and recognizes all your running applications, then recognizes the programming language of each one and auto-instruments it accordingly, using eBPF and OpenTelemetry. In addition, it deploys collectors that buffer, filter, and deliver data to your chosen monitoring tool, and auto scales them according to the amount of traffic. This automation allows developers to enjoy distributed traces within minutes as opposed to manual effort which can take months to implement.

Automatic instrumentation across programming languages is not a trivial task, especially when dealing with static binaries (like the ones produced by the Go compiler). We built multiple mechanisms to make sure we inject the relevant headers in a secure and stable way. We developed a system that tracks functions and structs across different versions of open-source libraries. In addition, we developed a system that performs userspace memory management in eBPF. As a result, Odigos is the only solution that is able to automatically generate distributed traces for compiled languages like Go and Rust. While other solutions require users to be experts in OpenTelemetry or eBPF, our solution does not require prior knowledge of observability technologies.

Our solution can be installed on any Kubernetes cluster by executing a single command. Once installed, we detect the programming language of every running application and apply the relevant instrumentation. For JIT languages (Java and .NET) or interpreted languages (JavaScript and Python) we deploy OpenTelemetry instrumentation. For compiled languages (Go, Rust, C) we deploy our eBPF-based instrumentation. All of this is abstracted from the user, who only has to: (1) select any or all of their target applications and (2) select a backend to send monitoring data to.

In May 2022, we released our first open-source project: automatic instrumentation for Go applications, based on eBPF. We later donated this project to the OpenTelemetry community and it is currently being developed as part of the Go Automatic Instrumentation SIG.

We are big believers in open standards, therefore the instrumentation and collectors used by Odigos are all based on open-source projects developed by the OpenTelemetry community. This also enables us to be vendor-agnostic.

Currently we are focused on building our open-source project. There are no pricing or paid features yet, but in the future, we are planning to offer a managed version of Odigos that will include enterprise features.

If you're interested to learn more, check out our docs (https://docs.odigos.io), watch a demo video (https://www.youtube.com/watch?v=9d36AmVtuGU), and visit our website (https://odigos.io).

We’d love to hear your experiences with tracing and monitoring distributed applications and anything else you’d like to share!




Wow, if this really works like you describe, then this is magic!

> Automatic instrumentation across programming languages is not a trivial task, especially when dealing with static binaries (like the ones produced by the Go compiler). We built multiple mechanisms to make sure we inject the relevant headers in a secure and stable way. We developed a system that tracks functions and structs across different versions of open-source libraries.

Could be very useful for non-greenfield projects. I'd love to learn more about the details, is there any writeup somewhere?

Though I'd still recommend new projects do "proper" tracing with not only one-per-service spans, but also spans for important functions, including additional application-specific tags, as that is easily 10x the value.

But since life is a sequence of tradeoffs, I think this project could be really useful in a lot of places.


> Though I'd still recommend new projects do "proper" tracing with not only one-per-service spans, but also spans for important functions, including additional application-specific tags, as that is easily 10x the value.

FWIW Odigos makes this possible because it uses OpenTelemetry (and generates OTel-compatible instrumentation for the eBPF-sourced data). You can go into an app that's instrumented this way, add an OpenTelemetry SDK, and start writing manual instrumentation or include additional instrumentation libraries. Your traces will just get deeper/richer when you do that.


We are actually doing technical deep dive on the next meeting of the OpenTelemetry Go auto instrumentation meeting in Tuesday. Will be happy to share the presentation afterwards.

In addition, we automatically create spans for popular open source libraries in use so you should also expect to see spans for database connections / cloud SDKs/ Kafka clients / etc. Definitely agree that manual instrumentation is very important in addition to the automatic one


With tech like eBPF dynamic instrumentation is surprisingly easy actually.

Still, always glad to see some innovation in this space.


Congratulations on the launch, and thank you for choosing an awesome license!

For an unrelated reason, today I was reminded about Pixie (https://news.ycombinator.com/item?id=25375170 and https://news.ycombinator.com/item?id=31687978 and https://github.com/pixie-io/pixie#readme ), which says is also an ebpf kubernetes observability tool, also Apache licensed.

I suspect the difference may be your aspirations to move out of just kubernetes, but I wondered if that's the biggest difference between your project and theirs? Or maybe the C++ versus golang?


As far as I know Pixie use eBPF for generating metrics. Odigos is focused on generating distributed traces which is a different signal that spans across multiple applications


Looks awesome! I hadn't had the chance to dive into eBPF yet, but I had hoped someone would be able to use it in a clever way like this!

I was digging through the docs and it looks like you have custom language detection. Did you consider trying to extract the language detection features from buildpack to do this? I imagine you'd get more reliable results and less to maintain if you used that as the basis.


Yes we are actually using a combination of env vars / process names / linked libraries and container metadata to detect the language


Very cool!

I'd imagine the challenge here is the long tail of tracing and metric needs. I'm thinking things like:

- For the JVM, do you support things like thread pools and execution contexts well? e.g. say part of serving up a response to an HTTP request means executing some async work against an execution context, does the context propagation work properly? And if so, would this work for other JVM languages, like Scala, or just Java? When I've manually instrumented apps for context propagation, it's been easy for languages like JS (Node) and PHP, but hard for languages like Scala, where there are so many different concurrency models ppl use

- Some units of work/tracing are pretty standardized, like say serving up a response to an HTTP request. But others less so, for example work triggered by job queues/events, where essentially a message on some sort of Kafka/Redis/Postgres/whatever queue triggers your app to do some work (instead of an HTTP request). I have trouble seeing how Odigos would instrument this well - e.g. even if you detect the work, how do you label related metrics well (can't just rely on HTTP method/path)? How do you measure success/failure of the job? Or if you don't try to tackle this sort of use case, would there be something like Odigos libs for manual instrumentation, where necessary?


We are actually able to handle the long tail of tracing by leveraging the amazing open source community. For languages like Java we use the automatic instrumentation created by the OpenTelemetry community which is really great and support ton of libraries, you can see a list of supported libraries here: https://github.com/open-telemetry/opentelemetry-java-instrum... This also allows us to support async tracing like doing context propagation over Kafka message is also something we support (depending on the programming language)


Ah cool! So like, if I used some Open Telemetry libs for more manual instrumentation, would it "play nice" with the automated instrumentation? Like say:

- I instrument a Scala app with Odigos, and it handles say 90% of the metrics, trace spans, etc. that I want

- But I want to add some extra spans, extra metrics

- If I then explicitly add OpenTelemetry libs as dependencies, will they conflict with the automated OpenTelemetry instrumentation (e.g. no "JVM dependency hell" issues like "I manually add 2.x of this lib, but then Odigos monkey patches it to 1.x, breaking my manual instrumentation")? And is there a way for the manual instrumentation to "play nice" with the automated instrumentation, e.g. I choose destinations in the Odigos UI for where to send traces, metrics, etc., is there a way for me to sort of have my manual instrumentation automatically target the same destinations?

Obviously you guys are an early stage startup, if there's no clear answer on hand for some of these questions, I'd just have to try and see, that's totally fair too :) I do love this idea of crazy easy 1-click style instrumentation.


Yes exactly. Odigos plays nice with manual instrumentation, meaning distributed traces will include both automatic and manual spans. Currently there is no way to point manual instrumentation to the destination selected in Odigos but we working on it and should release it soon. Most SDKs have a concept of no-op exporter that way Odigos will be able to pick up the manually created traces and deliver them to the chosen backend


Very cool, ty for the responses!


This is awesome! Request tracing is basically the fundamental building block to observability in a distributed system.

Doing it automatically is a huge win!

Congrats on the launch and I look forward to learning more!


Thank you for the feedback! We believe a lot of innovation can happen with distributed traces, and Odigos is just the beginning


I am very amused by your choice of name, as Odigos is to land what Kubernetes is to sea.


in greek it means Driver, the one who drives lol Edit: just saw your username, im pretty sure you already knew xD


Haha yes, and Kubernetes is the guy who drives ships.


Congrats on the launch! OpenTelemetry/Distributed Tracing has been in dire need of quality of life improvements, so I'm glad to see more folks filling in the gaps.

I see you're injecting trace IDs into programs. How do you guarantee that this doesn't break the binary or flag any security/compliance requirements?


This is something we are thinking about a lot. We developed multiple mechanisms to make sure we inject the IDs in a safe way. You can see the code here: https://github.com/keyval-dev/opentelemetry-go-instrumentati...


> dire need of quality of life improvements

Agreed! I'm one of the maintainers of part of the project - what sorts of things are top of mind for you w.r.t. quality of life improvements?


This is really cool. Upon further Googling, readers may be interested in https://kubernetes.io/blog/2017/12/using-ebpf-in-kubernetes/

If you can go beyond Kubernetes, I think that'd give Odigos more staying power. Naturally some integrations are out of your hands, AWS Fargate being one (https://github.com/aws/containers-roadmap/issues/1027). However, if you could get integrations up and running with the likes of Fargate, Fly.io, Render.com etc. That'd be amazing.


Support for non-Kubernetes environments is something we are planning to release very soon.


This is really cool - given my perception of the target market it might be worth targeting AWS Elastic Container Service (ECS) next as the userbase there, I would imagine, is generally looking for less-complex solutions (given the complexity difference between Kubernetes and ECS).


ECS is definitely on our roadmap!


I was thinking of giving it a try, but why does it have Datadog as a prerequisite

> A Datadog account with API key. Go to Datadog website to create a new free account. In addition, create a new API key by navigating to Organization settings, then click on API keys, and create a new key.

https://docs.odigos.io/prerequisites


I thin that prerequisite is only for that tutorial.

If I understood correctly, Odigos supports a bunch of observability backends and, instead of Datadog, you could use Jaeger, Splunk or Open Telemetry (for example).

https://github.com/keyval-dev/odigos/blob/main/DESTINATIONS....


This is not a requirement, sorry for the misleading documentation. We just rewritten everything and this bullet is probably a leftover from previous version of the docs. fixing it now.


Interesting, it looks like you've put some hard work into this project. My question is, what if a pod has multiple containers in it? How does Odigos choose which icon/programming language that is displayed for the pod? For example, I have a Deployment that runs pods with two containers: a php-fpm container and a nginx container. Would the "Choose Target Applications" page show an icon for both Nginx and PHP for the given Deployment? Would Odigos report separate metrics to the backend Desination for both PHP and Nginx?


Odigos will be able to instrument both containers each with the relevant instrumentation. As you pointed out, there is currently a bug in the UI that shows just one programming language per pod. Working on fixing it soon


The BPF instrumentation is quite cool! I wonder if uprobes have a performance impact. Does it roughly compare to a single syscall?

https://github.com/keyval-dev/opentelemetry-go-instrumentati...


Really nice ! I was looking into implementing tracing for a few projects I'm being onboarded on, and it seems to solve the "ask the devs nicely to implement OpenTelemetry are at least merge my commits" part.

My "gut instinct" would be to export that to Jaeger, but I'm open to suggestions as to better alternatives. We're on GCP so it might be an opportunity to try Google Cloud Trace as well.


Amazing project. Is everything open source? Are you planning anything for the big enterprise who wants to pay for the service?


Thank you! Yes, we are working on adding enterprise features.


Wow congrats guys, that is a game changer ! That have the possibility of becoming a standard in some companies I worked with


Are there any plans to branch out past Kubernetes? I'd be very interested in Odigos but I have separated myself from Kubernetes and am all in on Nomad. I've been looking at how I want to handle tracing and telemetry and this seems it'd be a great fit except for that minor detail


Definitely. Nomad is probably the first environment we are going to support after Kubernetes


Just saw the demo video, looks awesome. Is this tool from the future or some dark wizard tricks? Keep up the great work.


Distributed tracing really ought to be built into every web application framework. What's the value in signing over your autonomy to a framework if it isn't going to handle cross-cutting concerns like forwarding correlation IDs from the inbound request to all outbound requests triggered by that request?


Unfortunately not all web frameworks do this automatically. In addition sometimes you may want to propagate ID over non http connections like database drivers or even message queues.


I'm curious how this will stack up against Sysdig/Falco - https://sysdig.com/blog/sysdig-and-falco-now-powered-by-ebpf....

eBPF for the win, this is a nice approach with Odigos.


I don't think it is comparable to falco. Talk is more about security violations of the container. It is not related to distributed tracing.


Falco is really cool project but it focuses more on security. Odigos is focused on getting better monitoring signals from your applications, especially distributed tracing


Looks cool! Great to see entrants into this space.

How does this compare with Cilium? Looks like they do OT tracing (https://github.com/cilium/hubble-otel) but it's not native/core, is that the main distinction?


As far as I know cilium does not do automatic context propagation and require code changes to achieve it. Odigos automatically do context propagation


Are there available positions for hire in the company? This sounds really interesting!


Not yet, but hopefully soon. Thank you for the feedback!


Congrats on the launch! Glad to see that you also support SigNoz as a backend.


Is it possible to use this for non kubernetes setups (for example, a single docker container or a single server).


Not yet, but i can setup some custom docker compose yaml for you depending on the programming language you are using


I'm using Rust (with warp framework if it helps). I can help test if thee is a docker compose :)


Good luck, looks amazing!

תמיד כיף לראות שמות ישראליים פה, בהצלחה!




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: