Hacker News new | past | comments | ask | show | jobs | submit login
We Don’t Run Cron Jobs (2016) (nextdoor.com)
177 points by bedros on Aug 21, 2018 | hide | past | favorite | 140 comments



Cron works great when you don’t need to guarantee execution, e.g., if a server goes down. Unfortunately, all the alternatives are pretty heavyweight, e.g., Jenkins, Azkaban, Airflow. I’ve been working a job scheduler that strives to work like a distributed cron. It works with very little code, because it leans heavily on Postgres (for distributed locking, parsing time interval expressions, configuration storage, log storage) and PostgREST (for the HTTP API). The application binary (~100 lines of haskell), polls for new jobs, then checks out and execute tasks. The code is here if you’re interested:

https://github.com/finix-payments/jobs

It compiles to machine code, so deploying the binary is easy. That said, I’d like to add some tooling to simplify deploying and configuring Postgres and PostgREST.


That is hardly "all the alternatives". Some of the alternatives to the Vixie cron family (https://news.ycombinator.com/item?id=17005677) are:

* Uwe Ohse's uschedule, http://jdebp.uk./Softwares/uschedule/ https://ohse.de/uwe/uschedule.html

* Bruce Guenter's bcron, http://untroubled.org./bcron/

* GNU mcron, https://news.ycombinator.com/item?id=17002098

* Thibault Godouet's fcron, http://fcron.free.fr/

* Matt Dillon's dcron, http://www.jimpryor.net/linux/dcron.html

Other toolsets include Paul Jarc's runwhen (http://code.dogmap.org/runwhen/) which is designed for first-class services individually scheduling themselves.


To be fair, all of these are node-local, not distributed, which is what the parent was talking about. (Some of the ones you mention are also very old and unmaintained.)

For a distributed cron, you need a more sophisticated, transactional system that can deal with node failures.

There are many cron replacements, but they generally don't tackle the main problems with the original cron, such as the inability to retry, queue, prevent overlapping jobs, enforce timeouts, report errors and metrics in a machine-readable way (spewing out emails is not a good solution), etc.


I think what you're describing may be a batch sceduling system, such as PBS or LFS.

Those seem to be rarer, since the legitimate use case for a distributed system (what used to be called "grid computing") were rare.

Nowadays, I assume everyone just uses Yarn on Hadoop, when they think scaling "up" ends at a mid-range 2U server. I don't know how good its actual time-of-day/calendar based scheduling is, though.


I disagree. If you really look at the problem space, it turns out that "classic cron" is just a "batch scheduling system" that is poorly implemented.

For example: Want to run a backup every night? With cron, you run into several issues: A backup job could fail; how do you recover? A backup job could run (due to sudden latencies) for an unexpectedly long time, overlapping with the next scheduled time; how do you prevent "dogpiling"? How do you record the complete log about when each job ran, what it output, and whether it was successful or not? And so on. Or for that matter: What if the box that is supposed to schedule this job goes down?

These are fundamental systems operations tasks that you want Unix to solve. Unfortunately, cron leaves all the actual hard challenges unsolved; it's fundamentally just a forker. Cron, then, isn't really useful for much except extremely mundane node-local tasks such as cleaning up $TMP. I can think of relatively few tasks in a modern environment that can use cron without running into its deficiencies.

This means that, for example, backups tend to be handled by an overlapping system that actually has these things built in. This is a shame, because the Unix philosophy wisely encourages us to separate out concerns into complementary tools that fit together. Instead of a rich, modular, failure-tolerant system that only knows how to execute jobs, not what the jobs are, you get various monolithic tools that build all of the logic into themselves.

Not "everyone just uses Yarn" at all. For the projects I work on, we use distributed cronjobs on Kubernetes, which solve pretty much all of the problems with cron. For many people, both Hadoop and Kubernetes are overkill, though, yet they would benefit from a resilient batch scheduler.


> I disagree. If you really look at the problem space, it turns out that "classic cron" is just a "batch scheduling system" that is poorly implemented.

I'm a bit confused. It still sounds like you're describing a batch scheduling system and that cron isn't one (because it only implements a narrow function of such a system).

What features does PBS (not cron) lack that any of the re-inventions do have?

> Unfortunately, cron leaves all the actual hard challenges unsolved

> the Unix philosophy wisely encourages us to separate out concerns into complementary tools that fit together.

I'm also having trouble reconciling these two positions. Cron does one thing, which is forking on a schedule. To leave something like "dogpiling" for another utility (e.g. dotlockfile) to solve seems consistent with the Unix philosophy.

> Not "everyone just uses Yarn" at all. For the projects I work on, we use distributed cronjobs on Kubernetes

That doesn't quite refute my point, as you can feel free to consider "Yarn" as merely a metaphor for the currently most-popular inbuilt scheduler of a currently popular distributed computing platform.


Thanks for sharing these. In my case, I needed a system capable of sending payments at precise intervals, ensuring that a single payment was sent exactly once, despite the job being shared by a group of servers (for redundancy and high availability). Postgres provides the central locking, so tasks can be handed out to the pool of workers, while also supporting failover by streaming to a warm-standby replica.


Jenkins is "heavyweight" but Postgres isn't?


When I was running both at the same time about five years ago, yeah, I personally found Jenkins to be more operationally expensive than a small postgresql instance. It required more memory, more CPU, and more time to set up and keep running as desired. Maybe things have improved or maybe it's just a YMMV thing, but yeah, I would have said the same thing as your parent without thinking twice.

Edit to add: oh yeah, I just read another comment that reminded me that it also ate up lots of disk space as well.


Well if Postgres is already in your stack, it's no extra maintenance.


In what way is it not? I'm really curious actually


As someone who has administrated both, I’d rather manage 10 Postgres instances than one Jenkins box. No question.

Edit: I should expound. Jenkins seems like it has a lot of clunky moving parts. It all works, and I’d rather use it than anything else, but it’s kind of like IKEA furniture: you use it because you have to, not necessarily because you want to.

It’s also incredibly difficult to automate. I can configure Postgres with a config file or two and easily use Ansible to get the exact same instance every time. Jenkins has to be dragged into Automation Alley kicking and screaming. I partially blame this on the fact that Jenkins has nontrivial amounts of configuration that’s done via GUI. I approach a long-running Jenkins instance with the same fear and dread I approach a Windows box that hasn’t been restarted in six months. I.e. the box is now a snowflake and trying to make it reproducible and automated is going to be a bad time.

I could go on, but as a devops critter, Postgres wins every time.


Jenkins is probably 100 times easier to deploy, configure, operate, and scale than Postgres+whatever, which is a horrifying statement, but still true. If what you have to do is schedule and run arbitrary jobs on any kind of machine, it's no contest. They aren't even related. Postgres is a relational database, and Jenkins is a single Java process that stores flat files on local disk, connects to remote nodes with SSH, and has a thousand plugins.

It's like comparing a missile with an airplane. One gets where it needs to go faster and more efficiently, and the other one transports people.


Here’s a fairly objective metric for comparing the complexity of deploying Jenkins vs Postgres:

https://github.com/geerlingguy/ansible-role-jenkins

https://github.com/geerlingguy/ansible-role-postgresql

Two ansible roles by the same author, supporting both Centos and Ubuntu. Not hugely different in complexity IMO. Installing Postgres on FreeBSD, though, is little more than

    pkg install postgresql10-server
    sysrc postgresql_enable=“YES”
    service postgresql initdb
    service postgresql start


I get that they install differently, but that's not the point. The point is that they do very different things, in different ways. Postgres is not a Jenkins replacement. Postgres is storage and querying. Its equivalent on Jenkins is XML files.


Jenkins is one of those things that looks good until you trip over the numerous bugs in it. Such as the slightly terribly leaky implementation that hoses itself once a week, the severely million inodes it decides to gobble up which makes backups a chore etc.

Sure it works but from an admin perspective it’s horrid.


Is this really more horrid than having to hire a DBA to maintain, operate and upgrade the cron replacement? I can write a script to identify and remediate bugs in apps. I can't so easily write a script to diagnose badly used database apps and get the developers to rewrite their queries to not blow up the database servers. It's much easier to set up, lock down, and maintain Jenkins than Postgres+whatever, imo. But I guess everything can go to hell depending on how it's used.


>Is this really more horrid than having to hire a DBA to maintain, operate and upgrade the cron replacement?

Only you don't. I'm not sure why you seem afraid of Postgres, but millions of people use it, across tons of companies, and even for personal projects, and it's trivial to setup and run. Oh, and those lots of plugins and stuff you mention? You don't have to use them if you don't need them. They don't even enter the picture at all.


> I'm not sure why you seem afraid of Postgres

Because I've managed database applications before?

> and it's trivial to setup and run

Jenkins is more trivial. It's one process. It doesn't depend on an external high-availability networked data service. Backup is 'cp -r SRC DEST', or a plugin if you're fancy. And, again, Postgres does not replace Jenkins, it's just the storage and querying. It adds a lot of complexity and service availability points of failure that Jenkins does not have.


Last I checked. The plugin to do backup didn't work.

Copying the jenkins directory can be done but restoring it on another fresh install will not work. There are many files that needs to be deleted manually until the instance can start without error.


We have over 100Gb over tens of millions of inodes in Jenkins. This does not scale well even on XFS backed with SSDs. Takes hours to run backups.

pg_dump on the same data volume takes about 14 minutes to dump, compress and move to another node.

The filesystem is a shitty database. Thought we’d all learned that by now.


> The filesystem is a shitty database. Thought we’d all learned that by now.

How dare you :) "The" filesystem is a great database.. for certain applications. Big, binary blobs of video, store, and even index, particularly well!

I think what the collective "we" haven't learned is to avoid trying to think about scale intuitively (rather than "doing the math") and to avoid extrapolating from the trivial scenario.

It's why "latency numbers every programmer should know" is still a thing.


It’s ok for the narrow set of use cases that it is good at. Outside of that it needs something that has some intelligence and ability to read optimise it.

Heavily locked stuff, lots of small things, huge number of locatable data entries, not so much. Which is Jenkins.

As for scale, Incidentally I worked on a very old filesystem based store back when we had spindles to contend with. We had four racks of Sun disk array trays. The only way it performed well was keeping it to 2Gb spindles. I’m well aware of scale issues on file systems, perhaps moreso than the Jenkins developers. We had to scale that to 4000 concurrent users.


> It’s ok for the narrow set of use cases that it is good at

My comment was partly in jest, and mostly hoping to spur conversation, not as a true disagreement or criticism with your comment.

Still, saying "It's OK" for those narrow use cases may be too dismissive, even if "great" is an exaggeration. There are plenty of examples where DBMSes (especially relational ones) have fared poorly in comparison.

> Outside of that it needs something that has some intelligence

I fear I'm missing your point here. Certainly "the filesystem" as in the Unix syscall interface to a hiearchical arrangement of files, lacks intelligence, but that doesn't mean the specific underlying implementation must.

We've even come a long way from every being the Berekeley Fast Filesystem. Besides the many choices of underlying filesystems (including CoW ones like ZFS and BtrFS)

> and ability to read optimise it.

I assume by read optimization you don't just mean something like the buffer cache, but a user-specified index?

> Heavily locked stuff, lots of small things, huge number of locatable data entries, not so much. Which is Jenkins.

Does Jenkins use external locking? It would be odd in light of some of the comments elsewhere in the thread touting its advantage of being a single process. Of course, even if it's using only locking internal to itself, there's a good argument that its authors needlessly re-invented a DBMS (which we've seen happen when other niche-use databases get used for broader purposes).

I'm not sure a large number of tiny files is inherently problematic for a filesystem, merely problematic for existing implementations, and some are better at it than others. What about something like libferris (assuming perfectly spherical cows and ignoring the performance implications of FUSE for a moment), which can back a filesystem with an arbitrary database?

IOW, is Jenkins-using-the-filesystem an Ops optimization/tuning problem, or is it a more fundamental problem that can only be addressed with modifying its code?

> The only way it performed well was keeping it to 2Gb spindles.

I worked with the aforemention hardware extensively, early in my career, but at a company that wrote a data warehousing (aka OLAP) RDBMS.

I'm reasonably confident in saying that your performance observations have almost nothing to do with the filesystem itself and everything to do with I/O performance in general. Large numbers of smaller spindles were absolutely required good database performance and scalability.

> I’m well aware of scale issues on file systems

I didn't mean to suggest you didn't, rather the opposite, as "the collective we" was a euphemism meant to imply "everyone but us".

Anyway, my overall point is that you and I may be acutely aware of the real, practical problems with scaling filesystems, but we're rare. Since there's nothing fundamental/theoretical that makes the filesystem an obviously poor choice at modest scale (the definition of which increases as computing power increases), the lesson does not get learned by everyone.

Instead, because truly large scale becomes rarer and rarer as computers become more powerful (CPU more than I/O, of course, but then.. SSDs), every time the lesson is re-learned by an individual/company, they think it's a new, or at least unique problem, and we end up with a re-invention of Portable Batch System (a fairer characterization than re-invention of cron, IMO).


I’m a postgres DBA as well. It mostly requires zero maintenance.


For me operating means running with no single point of failure with monitoring and upgrading with very little or no outage. Running a postgres HA is hell and we are moving away from it to a cloud managed (postgres) database because it's one of the most complicated and error probe part of are infrastructure. Jenkins sucks because it needs gui interaction on install (or at least it did years ago).


> Running a postgres HA is hell

How does Jenkins HA compare?

I've only ever used it in the internal/build scenario, never in production.


Use an ECS scheduled task or the equivalent in Kubernetes: problem solved.

Docker containers run using a task scheduler are perfect for this.


The last company I worked at used Jenkins for all cron jobs. It has great reporting, supports complex jobs, has a million plugins. It worked really great.


My thoughts exactly, this is a perfect use case for Jenkins.

Talk about reinventing the wheel!


I always find Jenkins to be an absolute nightmare. The IT/“devops” people hand install it (most likely on your incredibly unreliable on-prem VSphere or whatever) and then immediately forget about it. Of course, mere developers aren’t not capable of such a feat of system administration as installing Jenkins. Then three months later when it starts failing due to disk space or memory issues you have to beg them to fix it so you can keep working.


I would have to agree with this. I don't know what state of the art Jenkins is in now, but our company does infrastructure as code. I don't think Jenkin's would pass muster. I think we would see the UI as a major liability as time goes on. If we ever needed to migrate off the instance of Jenkins or a plugin is no longer supported, or a million other things, Jenkins is a rather large deployment for running scheduled tasks. There would be too many specialized things installed or hand manipulated for us. We already had major issues with another company basically hiding a lot of packed crap all over the place, so infrastructure as code has helped a lot.

We ended up writing 114 line typescript job scheduler that uses Mongodb as a store. Mongodb provides the atomicity of scheduling (find and modify). Beyond that our UI is Robo 3T. The job collection has 4 simple states.

Any process can write a job document with the earliest time the document can run. It is also scaled with the rest of our application: more instances, more jobs that can run simultaneously.

But... I can see in shops that don't have this complexity yet, Jenkins might be just fine.

Edit to add: We decided against using another data source like a Queue which is something we would have used a long time ago, but we're already at an infrastructure complexity point where it would not be worth it - we already have Elasticsearch, MongoDB and Mysql so adding a AWS queue or rabbit would be out of the question at this point unless it provided a massive functionality we would need for something else.


> But... I can see in shops that don't have this complexity yet, Jenkins might be just fine.

It sounds like you have a very simple collection of jobs running that lack run-time complexity like remote hosts being unavailable or preventing a service from being overloaded with requests after it comes online because you have an ever-growing list of tasks waiting.

The actual complexities of DAG-based workflow management tools are considerably more complex than can be expressed in 114 lines of any language, and I hope any programmer I work with would spare me and my peers the misery of trying to roll and maintain our own when plenty of open source and actively maintained projects fit the bill that could benefit from our contributions.


Yep that is true, our jobs have no forward or reverse dependencies, no knowledge of what came first or what is next. Just a scheduler. If I picked up an open source tool, our company would be in process hell including managing how to integrate it with our stack.

I don't know why you would mention this because you don't have the requirements I have. If I had solved all scheduling needs for all people in 114 lines of code I would have written so.


Sounds like your problem is with your admin department and not jenkins...


As the admin department of a company that runs Jenkins, it’s the second worst thing we have to keep alive (black duck software was the worst).

Jenkins is leaky, unstable, unfriendly to the filesystem, buggy and poorly documented and tested.

If you’re going to do anything like this it’s actually worth paying for team city or something else.


Agreed.

In a gig last year, I needed to improve the build turnaround times on a Jenkins system. After learning everything I could about Jenkins, I realized the correct answer was to delete it and rewrite my own version that ran on the local system, which was way, way faster and much easier to debug and maintain.

Not having to commit/upload your code to a build server and then wait to get an executable/package back is an enormous time-saver just in that overhead alone, but even the build itself was faster, even though it was written entirely in bashscript.

Go figure.


Never mind: you can very probably find a security vulnerability that you can exploit to take over the machine so you can maintain it all yourself.


it was all "devops" and "self service" until you couldn't figure out how to click "discard old builds", eh?


"Hey, I need this free plugin or I can't even build my app."

"Maybe next year we'll have time to look into your plugin."


After the first couple hundred "free" plugins you will find that you pay in other ways - this is the same in anything that supports a plugin.

In any system I manage the cost of the plugin simply isn't an issue - if it's worth the money, it's easy to justify paying. But if it's free, and it becomes critical to the workflow, and in a year's time the author has abandoned it, then the "cost" is that we assume the responsibility for maintaining it forever, or we retool the workflow, or in some other way, it's very expensive. So paradoxically paid-for plugins are a much easier sell and requests for free stuff are shot down straight away.


I really dislike this anti-helpful attitude. If there's a better/for-pay option that you prefer, suggest it.


>"I always find Jenkins to be an absolute nightmare."

Nothing you go on to detail after that opening sentence has anything to do with Jenkins itself but rather your company and it's organization.


If engineers didn't constantly reinvent the wheel they wouldn't have anything to blog about.


This is my favorite comment on hackernews so far


Interesting, isn't Jenkins mostly used for internal builds? I have a hard time imagining using it for things like marketing email cron jobs.


Jenkins is just a job running tool. It runs the job, logs output and tracks success or failure by exit code. Also has very flexible and granular permissions.

Builds are just another job, usually triggered by a web hook after a git commit or periodic polling.

At my last job, we took cron jobs that were spread across 27 servers that we're really well monitored and moved them to Jenkins. The server SSH'd into the boxes (where necessary) ran the jobs, alerted us if there was a problem in Slack and we could use the tracked logs to find out exactly what happened. Really helpful.

Plus the cron scheduler has an alternative to picking a specific time to run by deferring to Jenkins based on other things that are running and the time it historically takes to run the job.

So instead of a bunch of jobs running at 3am you can set a job to run at [1-5]H and it will run sometime between 1-5am as best determined by Jenkins.

You can also trigger other jobs following the success (or failure) of another. For example, we used to have a daily digest that ran at the same time everyday and backed up our workers for a couple of hours. Instead, we created worker servers that only listened to that specific queue and scheduled a Jenkins job to start & update however many we needed, then after it was successful to trigger the digest jobs. Later in the evening, it was scheduled to scale down.

Just having all the jobs in one place with job specific log rotation rules and success tracking was extremely helpful.

It wasn't pretty...but it worked really, really well.


A quick look at the front page (https://jenkins.io/) suggests this is still the focus: "The leading open source automation server, Jenkins provides hundreds of plugins to support building, deploying and automating any project.".

I think this is a case of everything looking like a nail to someone with a hammer. Just because you need a solution for scheduling and jenkins does scheduling doesn't make it the right tool for the job.

Hopefully everyone recommending it is at least running separate instances for production environments? Some (thankfully) former colleagues of mine had our build server running production jobs...


The beauty of jenkins is that you can use it for pretty much anything.


I tend to feel the opposite about things you can use for pretty much anything. I much prefer tools that are good for a single thing. I really disliked maintaining a Jenkins instance awhile back; maybe this "jack of all trades" ethos is part of the reason I felt that way.


I tend to prefer simplified stack, especially for tools that don't need to be touched very often.


I can't tell: are you saying that using Jenkins results in the simplified stack you prefer? If so, I don't agree; I think Jenkins is a very un-simple way to run from jobs.


When I say simplified stack I mean fewer tools that must be learned and managed by people. Jenkins simplifies in that way.


Interesting. I still don't see it that way. It seems less like a single tool when used in this way than multiple tools running alongside one another. Different strokes for different folks I suppose!


Indeed, I was scratching my head reading this, wondering why they were reinventing Jenkins.


Java hate is strong out there and clearly blinding.


It might not be that irrational when you consider that Oracle has been making it clear that they intend to be paid for any commercial use of Java.


The blog said they are a Python shop. Jenkins being Java based probably won't sit well with them.


Why? You can run python or anything else in Jenkins just fine.


I've personally moved all my Python scripts I run as cronjobs (for my company, not personal scripts) into Jenkins, and that's never been a problem.

Everything is fully Dockerized, and Jenkins is set to just run the docker image with a specified entry point.


How do you test those complex jobs? How do you monitor jobs that get slower and slower? Can you make sure a job is not executed too soon? Can you have dynamic jobs (add/remove/enable/disable 1k per day)?

Different requirements, different solutions.


Doesn't Jenkins use a web interface to create, edit and manage jobs? Is it possible to put jobs in a VCS? Or manage jobs automatically (say with Ansible)?


Yes to the web interface. VCS kinda, the jobs are all XML files, but Jenkins has built in change history like Wikipedia allowing reverts and comparisons.


We've been moving all our cronjobs into Jenkins lately. It works wonderfully, and I really recommend it.


Same here, works great for us.


Same here. It’s great for it.


> Here is an example of a typical oncall experience: 1) get paged with the command line of the failed job; 2) ssh into the scheduler machine; 3) copy & paste the command line to rerun the failed job

No doubt this works if it's an established procedure, but if I was approaching a system I wasn't familiar with I would never do (3) because the environment can differ wildly between crond and a login shell. It is safer to edit the cron schedule and duplicate the entry with a time set for few minutes in the future. (And clean that up afterwards).


Not only that, but just ssh'ing in and blindly rerunning the failed job isn't the answer anyway. Research why it failed. If the job has to write a file on a full disk, you can rerun that thing a hundred times and it will never work. I'm sure they must have missed something from their write up as I can't imagine that's their on call playbook.


You could even write a cron job to clean up the crontab


Or you could use at(1) or batch(1) to invoke the job once. That seems significantly easier and less error prone than trying to clean up crontab.


at(1) is great, but you'll still have the problem of a different environment to cron (it copies most of your environment variables and working directory).


Or use Puppet/Salt/Chef/ansible to manage the file automatically


I've seen this done previously. It gets arduous to maintain a large list of cron jobs, teach everyone how to add new ones, rebuild the cron box and redeploy on updates to config management.

Just use Jenkins, until you outgrow Jenkins. Then use a distributed scheduler.


Can you recommend one?

We recently had to implement a prototype in house, not that easy as I thought it would be, but we could not find a good solution, e.g. handle 100k jobs, long term (never expiring), short term (depends on an entity's live cycle), high available.

Thanks.


Kubernetes if you're a containers first shop, Hashicorp's Nomad if you're not just containers (can run binaries, etc).


Kubernetes does Cron jobs.


If you’re already using cron like millions of us, here are a couple things that can help you:

https://crontab.guru

https://cronitor.io/docs/cron-troubleshooting-guide


I worked at a company which wrote their own scheduler and it was fraught with bugs. Dealing with time and date is HARD. Really, really hard. Your custom scheduler will break and at a worst possible time.

If Cron doesn't work, get an open source or commercial solution. And who cares what tech the scheduler is written in? Scheduler's job is to provide run your programs and provide API and a nice GUI if you desire.


Yeah exactly. I don't understand why they wanted the scheduler to be written in Python, since the scheduler should be decoupled from the jobs they are running anyway.


If your jobs are written in Python, there is nothing wrong with a Python-based scheduler. It can actually be quite convenient.


I think they have a task worker system built back in 2014[1], so they need to have something custom working with it as well. Back then I think they really didn't have many options, but if they were to do it again now, I think either AWS Lambda or AWS Batch will serve this type of scheduled job cases very well.

[1]: https://engblog.nextdoor.com/nextdoor-taskworker-simple-effi...


NMS engineer at an enterprise telecom here. At my company, we've been switching over to Jenkins for job scheduling. Most of what used to be cronjobs have been fully Dockerized, and now we have Jenkins run periodic "builds" via pipelines. The pipelines themselves just run a docker image.

The single biggest advantage this has gotten us is centralized logging. I can check on the console output of any cronjob just by going to Jenkins and clicking on the job.

Moving to Jenkins to cron wasn't my idea, but the implementation is mine. I've built a few base Docker images as bases. One is just the standard Python 3.6 docker image. Another is the CentOS image equipped with Python 3.6 and Oracle RPMs for jobs that need database access. Another is the aforementioned image plus a number of Perl dependencies for jobs that need to call into our legacy Perl scripts.

For many scripts, I can use identical Dockerfiles. I just copy the directory containing the script, requirements.txt, Dockerfile, and Jenkinsfile, then I change out the script, edit the Jenkinsfile to reference the new script's name, and make any needed changes to the requirements.txt.


I've had good experiences using celery-beat to replace crond. It lets you use all the good parts of celery without much work. http://docs.celeryproject.org/en/latest/userguide/periodic-t...


Ditto, especially in combination with Python/Django, as used by Nextdoor. Ironically, they had already removed Celery from their stack a few years prior. https://engblog.nextdoor.com/nextdoor-taskworker-simple-effi...


Since nobody mentioned Anacron, I will. https://en.wikipedia.org/wiki/Anacron Deals with the server possibly being down for a period.


That was a difficult read for me, the blog post starts out with the four main problems with cron (their use wasn't scalable, editing the text file was hard, running their jobs are complex, and they didn't have any telemetry.)

That's great, what does that have to do with cron?

As a result what I read was:

"We don't understand what cron does, nor do we understand how job scheduling is supposed to work, and we don't understand how to write 'service' based applications, so somebody said 'Just use cron' and we did some stuff and it didn't work how we liked, and we still haven't figured out really what is going on with schedulers so we wrote our own thing which works for us but we don't have any idea why something as broken as cron has persisted as the way to do something for longer than any of us has been alive."

I'm not sure that is the message they wanted to send. So lets look at their problems and their solution for a minute and figure out what is really going on here.

First problem was 'scalability'. Which is described in the blog post as "cron jobs pushed the machine to its limit" and their scalability solution was to write a program that put a layer between the starting of the jobs and the jobs themselves (sends messages to SQS) and they used a new scheduler (AP scheduler to implement the core scheduler).

So what was the real win here? (since they have recreated cron :-)) The win is that instead of forking and execing as cron does allowing things like standard in and what not to be connected to the process, their version of cron sends a message to another system to actually start jobs. Guess what, if they wrote a bit of python code that all it did was send a message to SQS and exit that would run pretty simply. If they did it in C or C++ so they weren't loading an entire interpreter and its run time everytime it would be lighting fast and add no load to the "cron server", this is basically being unaware of how cron works so not knowing what would be the best way to use it.

Their second beef was that cron tabs are hard to edit reliably. Their solution was to write a giant CGI script in a web server that would read in the data structure used by their scheduler for jobs, show it as a web page, let people make changes to it, and then re-write the updated data structure to the scheduler. Guess what, the cron tab is just the data structure for cron in a text form so you can edit it by hand if necessary. Or you can use crontab -e which does syntax checking, or you could even write a giant CGI script that would read in the cron file, display it nicely on a web page, and then re-write as syntactically correct cron tab when it was done.

Problem three was that their jobs were complex and failed often. This forced their poor opsen to log in, cut and paste a complex command line and restart the job. The real problem there is jobs are failing which is going to require someone to figure out why they failed. If you don't give a crap about why they failed the standard idiom is a program that forks the child to do the thing you want done and if you catch a signal from it that it has died you fork it again[1]. But really what is important here is that you have a configuration file under source code control that contains the default parameters for the jobs you are running so that starting them is just typing in the jobname if you need to restart or maybe overriding a parameter like a disk that is too full if that is why it failed. Again, nothing to do with cron and everything to do with writing code that runs on servers.

And finally their is no telemetry, no way to tell what is going on. Except that UNIX and Linux have like a zillion ways to get telemetry out, the original one is syslog, where jobs can send messages that get collected over the network even, of what they are up to, how they are feeling and what, if anything, is going wrong. There are even different levels like INFO, or FATAL which tell you which ones are important. Another tried and true technique is to dump a death rattle into /tmp for later collection by a mortician process (also scheduled by cron).

At the end of the day, I can't see how cron had anything to do with their problems, their inability to understand the problem they were trying to solve in a general way which would have allowed them to see many solutions, both with cron or with other tools that solve similar problems, would have saved them from re-inventing the wheel yet again, giving them the opportunity to experience the bugs that those other systems have fixed given their millions (if not billions) of hours of collective run time.

[1] Yes this is lame and you get jobs that just loop forever restarting again and again which is why the real problem is the failing not the forking.


This rant seems out of character, but personally I appreciate it. People are always suggesting getting rid of cron, but I have always liked and trusted it. I prefer using tools that are old, battle-tested, and standard, but I do try to appreciate the advantages of new things. Cron seems to be a favorite target of NIH, for as long as I remember but especially in these days of "serverless", so it's easy to second-guess my appreciation for it. Thanks for clarifying where their problems really were. There seems to be a common temptation to think a new tool will solve your problems, when really many are in the irreducible specificity of your own code or systems.

EDIT: Oh by the way, Ruby folks struggling with cron might appreciate this: https://github.com/pjungwir/cron2english/


Cron2English looks very helpful for folks who struggle to read lines in a crontab.

I recognize that it is a sore spot for me when people re-invent the wheel when it seems clear they didn't have to and even clearer that the energy spent re-inventing the wheel would have been better (in terms of using the wheel) spent learning why the wheel is the way it is.

It is sort of the Chesterson's Fence of computer science. Don't tell me its wrong, tell me how it is right for what it does and where it falls short for what you want it to do. Cron gets a bad rap here, as did sendmail for that matter. When taken to extremes you replace working systems like init with systems that are borked like systemd. No doubt cron is on the list of things to be absorbed by the systemd borg at some point. Yes, it makes me grumpy.


> cron ... absorbed by the systemd borg

This has already happened, right? https://wiki.archlinux.org/index.php/Systemd/Timers

Not a systemd fan, myself.


IMO cron is one of the parts that a good process management system should solve. With systemd timers I can do stuff like require other processes to run for a timer to run, activate a process either on a timer or on a socket, specify that it runs in a private chroot/netns/cgroup. These are all common things that I might usually want a process or service to do, and putting timed, manual, socket and path invocation of a service in the process management system makes sense to me.

I understand some of the systemd criticism, but I don't at all understand why you want something else to manage your processes than the process management system (which is basically always the init system)?


> I understand some of the systemd criticism, but I don't at all understand why you want something else to manage your processes than the process management system (which is basically always the init system)?

Perhaps you don't understand one of the fundamental, philosophical criticisms, which is that it violates the Unix tenet of "do one thing and one thing well".

Timing, and scheduling are difficult and (arguably, of course), deserve their own, separate utility.

Cron isn't a process manager, any more than and interactive shell [1]. It just starts processes (and passes output to a local email client, though that's, understandably, brittle).

I wouldn't object to moving the process-starting functionality out of cron and leaving it with only the ability to tell init to "start job named XYZ", but I do object to just integrating its core competency of scheduling into that init.

[1] Arguably less than, considering modern shells' job control and signal facilities


This. If I had too many jobs running on underpowered single point of failure hardware, I wouldn't immediately think cron is my problem and rewrite it. Rethink your architecture, write meaningful log messages, collect stats, figure out why so many of your services fail that often.


> ways to get telemetry out, the original one is syslog, where jobs can send messages that get collected over the network

The re-invention of this one is one of my pet peeves, especially since the vast majority of the (legitimate) complaints about early implementations have been addressed in modern (last.. 10ish years?) implementations.

Syslog is lightweight, flexible, and plain text.

Back when the ELK stack first started gaining popularity, I'd get asked in interviews if I had experience with provisioning "high volume" logging. I was reasonably convinced that none of those people had ever seen (or would ever see) a high enough volume to be remarkable.

In hindsight, it was probably the same as thinking one has "big data" if it doesn't fit RAM on ones laptop (or even, more charitably, a server, even a decently large one).


>>> The re-invention of this one is one of my pet peeves, especially since the vast majority of the (legitimate) complaints about early implementations have been addressed in modern (last.. 10ish years?) implementations.

It's ironic considering that syslog is not standardized. Plug any two syslog implementations and they won't work together.

The two main competing protocols are defined in RFC 3164 and 5424, both named syslog, they are incompatible with each other and have their own peculiarities.

Logstash, the L in ELK, only parses a subset of RFC3164 messages with very specific formatting.


> It's ironic considering that syslog is not standardized.

I don't find any irony there, at all, since it's reflective of the history of Unix, and consistent with the OC's "like a zillion ways to get telemetry out". More to the point, it's fairer to say that different syslogd implementations were parallel evolution, rather than re-invention.

Now, it seems you're talking about the network protocol, whereas "syslog" the system call seems not to be too different across platforms, certainly not to the point of incompatibility. I call this out, not for pedantry, but to illustrate that, from the app's perspective, it's all the same.

> Plug any two syslog implementations and they won't work together.

Unless, perhaps, one of them understands (or can understand, with plugins) more than one protocol?

Regardless, the solution requiring the least engineering effort is to make sure one has compatible wheels everywhere, not re-inventing the wheel.

> The two main competing protocols are defined in RFC 3164 and 5424

I'm not sure that it's fair to call them competing when 5424 explicitly obsoletes 3164 (and the earlier one is "informational", while the later one is a "proposed standard").

Of course, that says nothing about the reality of the competing implementations, but if there are only two, and the older one is the most popular (which I believe it is), then perhaps there's no real competition.

> Logstash, the L in ELK, only parses a subset of RFC3164 messages with very specific formatting.

That would be an example of a (partial) syslogd re-invention. I have no idea if that input plugin is worth using (e.g. for performance reasons) rather than using something like a pipe plugin for plain text from a local syslogd. I suspect for the vast majority of use cases, it's totally irrelevant.


A log message is meant to be processed and stored somehow. It's a problem if the message cannot be parsed and is dropped.

The two RFC are incompatible, then every implementation is slightly different in the handling of miliseconds/microsecond/timezone, the presence of the hostname field, structured data support, etc...

I am calling out your first post that syslog is lightweight, flexible and plain text. syslog suffers from serious issues.


> It's a problem if the message cannot be parsed and is dropped.

Agreed, but you now seem to be implying that this just happens because syslog, rather than misusing the tool.

Earlier, you asserted "Plug any two syslog implementations and they won't work together" which I didn't bother to contradict, because it seemed like mere exaggeration. However, at this point, it seems like an extraordinary claim requiring extraordinary evidence.

I have, of course, routinely, in my professional career, spanning decades, successfully used different implementations with each other, usually because (more than) one implementation is embedded and can't be changed to match what's on the servers.

Exactly which syslogd implementations connected to each other over the network will fail to parse messages and drop them?

> The two RFC are incompatible, then every implementation is slightly different

I think you're reversing cuase and effect, especially given both the relatively late date of the earlier RFC in the timeline of Unix history, as well as its enduring popularity over the later RFC.

> I am calling out your first post that syslog is lightweight, flexible and plain text.

I don't believe you've addressed, in any way, the assertion that it's lightweight.

Your challenge to its flexibility is, AFAICT, poor interoperability of only the network protocol, which you haven't backed up with evidence. Even then, that's a weak refutation of the flexibility of the mechanism overall.

Your challenge to its being plain text is, again, primarily based only on network protocol, and, even then, based on a widely-unimplemented RFC.

> syslog suffers from serious issues.

This may be responding to an implied strawman, that syslog somehow perfect or ideal for all use cases, which, of course, I never said. In fact, being plain-text could be considered a "serious issue" for some telemetry.


It's a symptom of people developing and not understanding the environment they develop in/for.

I've used attitudes towards cron, specifically, as a personal metric for judging someone's skill and output with systems. What this team did is exactly the wrong thing to do.

I would rather use the tool with over 40 years of development work gone into it than whatever this is.


Exactly, they solved a bunch of problems that weren't really related to cron and overall, IMO, made the situation worse.

High resource usage so they moved the actual jobs off-server. Not cron related. Crontab is hard to edit so we'll create an arbitrary data structure that'll be both hard to edit and non-standard. Good job. Our jobs are failing so instead of better error handling we'll just run them again manually. Great. And apparently they've never heard of cron logging to syslog.


We schedule jobs on our Elixir cluster. Nice to not need anything on top of what you're already running


> Second, editing the plain text crontab is error prone

Doesn't every crontab in existence have a comment line giving the order of the time columns? I know I always rely on such.

Although the article didn't touch on it, this point reminded me of yesterday's discussion about manpages and command-line options. I think it's still the case today that `cron -e` is the way to edit the crontab whilst `cron -r` (the key right nextdoor in case this part needed stressing!) removes it altogether.


I got tired of writing the time specs manually, so now I keep this script in my PATH:

    #!/bin/sh
    # print a crontab entry for 2 minutes into the future

    date -d "+2 minutes" +"%M %H %d %m * /path/to/command"


Using Elixir we just scheduled work using a simple genserver, the initial naive version has no tracking of jobs done/failed/etc, but you can append those easily since they are just language constructs.


Kubernetes CronJobs resource would be my go to for this.


Good answer in 2018 but in 2016 the development of ScheduledJob (now CronJob) had just begun around November/December of that year so it was not an option for these guys.


The CPU comparison are kinda funny to see. At first glance the low CPU usage looks good but to me that’s wasted resources. Good to see a more efficient system though. Hopefully those instances get allocated to different problem sets.


I use Apache Airflow with BashOperator for tons of stuff like this, simple web UI for logs/retries, supports dependencies between jobs and when tasks get more complex it’s Python and it supports extensions.


Apache Airflow is a great way to future proof your cron jobs. Existing cron jobs can be easily migrated and with that you'll get access to built in logging, distributed execution, connection management, web ui for simple monitoring and task retries and more.


A little tangential, but I recently created a small tool called "tsk" (pronounced same as task)

https://www.tsk.io

I'm calling it a "speed-dial for your APIs". I used to have a VPS that would run out of memory, so easiest way to resolve this was to restart the server. But even that was annoying, so I created a button that would call the API to restart the box. Now I just have to click the button whenever I want to reboot.

Something similar to CRON, I'm currently building a feature in tsk to schedule the tasks.


My scraping VPS keeps running out of memory too and I restart it every couple of days.

> so I created a button that would call the API to restart the box

Been thinking about building a 'button' that does this as well. Will check out TSK to see how well it addresses this. Sound idea for sure.


Thanks! Would love to hear your feedback on it!


sorry to say, but this whole premise seems just ... wrong.

why reinvent the wheel and not use cron or systemd? why reboot instead of searching for the offending script/process and fixing it?


Could be useful in a testing/build-as-you-go scenario.


We switched from Heroku Scheduler (very limited cron) to a system called Sidekiq Cron[1] (if we used Sidekiq Enterprise we would use the built-in scheduler). All Sidekiq Cron does is drop a job into the queue on a given interval. We also use HireFire to auto-scale our workers as necessary to keep things running.

[1]: https://github.com/ondrejbartas/sidekiq-cron

[2]: https://hirefire.io


We use Sidekiq's scheduled jobs[1] to replace the cron dependency in our codebase. Did you try it before opting for Sidekiq Cron?

[1]: https://github.com/mperham/sidekiq/wiki/Scheduled-Jobs


Where I work we have a similar product where we run all scheduled tasks on our Mesos cluster. Same idea as a uni cCron. You have a task that is executed every N minutes/hours/days, it runs on a box, and does its thing.

It doesn't replace every cron job, but it is distributed, fault tolerant, and only breaks when the hadoop cluster backs up. The product mentioned here seems like a good solution for a team that needs to execute linux crons without much overhead.


A bit of a side-topic but, has anyone tried APS (Advanced Python Scheduler - https://apscheduler.readthedocs.io/en/latest/) in production?

I've been evaluating it as it seems to provide fault-tolerance, but IMO the documentation could be much better with more examples (e.g. mixing different triggers, configs,.etc)

Can anyone comment on it?


Job Scheduling.. We're using jenkins heavily at one of my clients sites for this (it's already used for builds, so scheduling jobs didn't seem that far afield..) The thing I'm missing, and would love to know if exists, is a calendar like view of all of the scheduled jobs, ala google calendar. Surely a product must exist that provides this view?



Here it's Quartz and its JDBC persistence baked into a Dropwizard application with a custom resource implementing job and trigger CRUD ops, and a Spring-based JobFactory for dependency injection stuff. Quartz has a few funny behaviors that could be better, but on the whole it's been working out nicely for scheduling across a multi-node stateless cluster.


For JVM (Java, Scala, Clojure, Groovy, etc), Quartz with JDBC (backed by DB instead of in-memory) works best for us.


It’s worth noting that SQS promises “at least once” delivery. In practice, this means some messages will be delivered multiple times.

After every job is successfully finished, it should be noted in your db. Every time a worker starts processing a job, it should check the db to see if this job has already been run.


The old fogeys among us would point out that this is an incomplete implementation of the batch queues that VMS has had for 3 or 4 decades. The new fogeys would point out that 60% of this functionality (the hard parts) is present in PowerShell and all the rest is a boring CRUD app.


Based on the CPU usage graph it looks like there's a lot more opportunity to downsize the scheduler hardware. I'd be curious to see how their TaskWorker cluster absorbs spikes in load when big cron jobs start.


Or they could have just used Hashicorp’s Nomad....

https://www.nomadproject.io/

It’s dead simple to set up and use.


Nomad was started about 39 months ago. The post in question is from 32 months ago when they wrote about a thing they had already built.

"We’ve been using Nextdoor Scheduler for over 18 months and we are extremely happy with it."

That means that what they built pre-dates Nomad by at least 11 months.


They basically built sidekiq enterprise. Way to go.


I'm the author of this blog post and the original developer of this scheduler system.

Glad to see a lot of interesting and insightful comments in this thread!

Some contexts here:

0. Like any piece of software, this scheduler system is not perfect for every company -- legacy code, # of engineers, skillsets of existing employees, engineering culture, tech stack, business,...

1. Every engineer in Nextdoor needs to do on-call rotation -- 50+ engineers back in 2014 (100+ now?), when the scheduler was built. It's important to run a system that all 50+ engineers have confidence to debug on a Friday night if things go south. There are some blackbox-type alternatives of cron, which may be great for a small team of backend experts to operate, but may not be a good fit for 50~100 engineers with very diverse skillsets & backend experience. But why every engineer needs to be on call? Well, that may deserve a new thread of discussions :)

2. In 2014, there were ~200 production jobs with very different computing characteristics (450+ now). Jobs are owned by different teams and different people with different expertise. Jobs are frequently (every few weeks?) added, removed, and updated. Outages most likely happen during code deployment. It’s important to run a system that works well during deployment (and rollback), e.g., always run the RIGHT version of code for hundreds of jobs.

3. The core scheduler system was pretty tiny, which could be easily understood by any engineer in the company. I remembered it took me a few days (probably less than a week) to finish most of the code. The hard part was productionization, e.g., carefully migrating 200 production jobs to the new system, logging, monitoring/alerting at job level, deployment & rollback process, oncall-related things (documentation/training)… With this simple system, we could easily enumerate various failure situations, so oncall engineers have confidence how to respond.

4. I’m not with Nextdoor any more. But turns out this scheduler system is still working well now: https://github.com/Nextdoor/ndscheduler/issues/33#issuecomme... I think the ROI is pretty good -- a 3-man-month project (from design to run all 200+ jobs on the new system) => running 4 years and counting & easy oncall situation for 50~100+ engineers.

5. Why not Jenkins or other open source alternatives? We investigated a bunch of alternatives. Jenkins is great. Back in 2014, Nextdoor used Jenkins for CI (not sure if it's still true now). We ruled out Jenkins (and Rundeck or the like) for operational reasons, e.g., challenges to integrate with existing code deployment process, operational complexity for 50+ engineers with different backgrounds/expertise... Open source alternatives? Well, in 2014, we couldn't find a good Python-based project. It's important to limit the number of languages/external technologies in the tech stack.


cloudwatch rules ??


Bingo! Those work great paired with a AWS Lambda function


Have they heard of crontab -e? It checks that your cronjobs have valid syntax. There are a crazy number of edge cases related to time. Dealing with time sucks, let the smart people who battled tested and built cron deal with it.


What does that have to do with the problems that they dealt with? (randomly failing jobs, resource management, monitoring, etc)


I think the comment is in reference to the second problem listed, "editing the plaintext file is difficult."


Well, to be fair none of those things you list are due to a failure of cron itself either.


So, if not cron, what is better?


Aurora? Nomad?


We use Jenkins when we need something more complicated than cron can provide.


jenkins?


They really ought just have learnt to use Cron properly


I was expecting everything is 'reactive'( i.e react to events in real time) so there is no need for cron jobs.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: