What I haven't figured out
Some technical things that I feel like I've never really figured out, that I'm trying to figure out:
How much should applications fail-fast and how much should they tolerate failure?
Failing fast feels right and I've implemented it in a lot of places - using envsafe or something similar is a necessity on any project I work on, for example: if an application isn't properly configured, it should fail at startup instead of limping along.
But should applications tolerate failed database queries in an elegant way? What about failed external services?
I think one clear line is that an application shouldn't allow internal inconsistency. For example, if you have some function that's being called with an incorrect argument type, you update the callers instead of making the function more flexible. This probably isn't the case when companies grow in size because eventually you can't tell a whole team to just update all their code when an API changes.
But the line keeps moving: in particular, I think the last two years has shown me that it's useful to have a system that can fail partially, and that every single external service will fail at some point, and you should have a plan for those things, whether it's tolerating failure or doing retries.
How should logs work?
Every application that I've worked on eventually just generates several 'flavors' of log messages out of stdout and stderr and logs stop being useful because they're filled with 'junk' like request logs.
I've tried structured JSON logs with pino, tried tslog, Betterstack, Axiom, and never got it. We've never had a team member that really got value out of logs. I've never really gotten value out of logs. I often wonder if servers should emit logs at all, and instead we should just do telemetry and metrics?