A startup founder called me last month after a 6-hour outage. Their payment processing had failed silently — no alerts, no monitoring, no logs that explained what happened. They found out because a customer emailed asking why their charge didn't go through. Six hours of lost revenue, and they couldn't even tell their customers what happened because they didn't know themselves.
This company had 12 engineers and $3M ARR. They'd never invested in monitoring because "we haven't had problems yet." They hadn't had detected problems. That's different.
When to Start
The right time to invest in observability is before your first production incident, not after. Practically, that means:
5+ engineers or first paying customer — whichever comes first. At this point, no single person has the full mental model of the system anymore. When something breaks, you need tools to answer "what changed?" and "what's affected?" instead of relying on someone's memory.
You don't need a full observability platform on day one. You need three things:
Layer 1: Error Tracking
An error tracking service like Sentry, Bugsnag, or Rollbar. This captures unhandled exceptions, gives you stack traces, and groups similar errors together. Setup takes an hour. The first time it catches a null pointer exception in production before any customer reports it, it pays for itself.
Configure alerting so errors above a threshold page someone. Not every error — that creates alert fatigue. But a spike in error rate, or a new error type that's never been seen before, should get human eyes within minutes.
Layer 2: Infrastructure Metrics
CPU, memory, disk usage, network I/O, and response times for every service. If you're on AWS, CloudWatch covers the basics. GCP has Cloud Monitoring. For more sophisticated needs, Datadog or Grafana Cloud.
The metrics you should alert on first: response time P95 exceeding your SLA, error rate exceeding baseline by 2x, disk usage above 80%, and memory usage trending toward limits. That's it. Four alerts. You can add more as you learn what your system's failure modes look like.
Layer 3: Structured Logging
Not console.log("something went wrong"). Structured JSON logs with consistent fields: timestamp, service name, request ID, user ID, severity level, and a human-readable message. Every log line should be searchable and correlatable.
The critical pattern: correlation IDs. When a request enters your system, generate a unique ID and pass it through every service call, database query, and queue message. When something fails, you search for that ID and see the complete journey of that request across your entire system. Without this, debugging distributed systems is archaeological guesswork.
When to Level Up
Once you have the basics, there are clear triggers for each next investment:
Distributed tracing (Jaeger, Tempo, or built into your APM): when you have 10+ services and debugging cross-service performance issues becomes a weekly occurrence. Tracing shows you exactly where time is being spent in a request flow — which service is slow, which database query is the bottleneck.
Custom business metrics: when you hit $1M+ ARR or when business stakeholders start asking "how many X happened today?" Build dashboards that track signups, conversions, feature usage, and revenue-impacting events alongside your infrastructure metrics. When the signup rate drops, you want to know immediately whether it's a product issue or a platform issue.
SLOs and error budgets: when you have enough traffic that "five nines" becomes a meaningful concept. Define what "available" means for your service, set a target (99.9% is reasonable for most B2B SaaS), and track your error budget. When the budget is burning fast, prioritize reliability work over feature work.
The Observability Anti-Patterns
Dashboard graveyards. Teams create 40 dashboards in the first week and look at none of them by month two. Start with 3 dashboards: system health overview, error rate detail, and business metrics. Add new ones only when you have a specific question they need to answer.
Alert fatigue. If your team gets paged 20 times a week, they'll start ignoring pages. Every alert should be actionable — meaning someone needs to do something right now. If an alert fires and the correct response is "wait and see," it shouldn't be an alert. Make it a log entry.
Monitoring only production. Your staging environment should have the same monitoring as production. If a performance regression is going to happen, catch it in staging, not at 2am in production.
Vendor lock-in on observability. This is ironic but common: the tool you chose to give you visibility into your systems becomes the system you're most locked into. Use OpenTelemetry for instrumentation — it's vendor-neutral and lets you switch backends without re-instrumenting your code.
Related: DevOps Fundamentals for Growing Teams, Post-Incident Reviews That Actually Work, Platform Engineering: Right-Sizing the Investment