Most teams running Odoo in production know when the system is slow. What they rarely know is when it became slow, which requests caused it, and whether the problem affected business operations or just irritated a few users. That gap between feeling and evidence is what monitoring is supposed to close.
ERP observability is not the same as generic application monitoring. The signals that matter are tied to business operations, and the thresholds that matter are tied to business consequences.
Operational Monitoring Versus Business Monitoring
Operational monitoring answers whether the system is running. Business monitoring answers whether the system is working. Both matter, but teams usually invest heavily in the first and neglect the second.
Operational signals worth tracking:
- HTTP error rates and response status codes across Odoo endpoints.
- Odoo worker count, active versus idle, to detect pool saturation.
- PostgreSQL connection utilization.
- Memory, CPU, and disk usage per container.
Business signals worth tracking:
- Queue job failure rate and average processing time by channel.
- Cron job errors and time since last successful run.
- Order confirmation latency from creation to confirmed state.
- External integration failure rates for EDI, payments, and shipping APIs.
Operational signals tell you the system is degraded. Business signals tell you whether that degradation is affecting what the company cares about.
Not every metric justifies attention. Averaging response time across all routes hides the worst offenders. A sale order confirmation that takes eight seconds is invisible when averaged alongside dozens of fast requests. Track the 95th and 99th percentile response times for the five to ten endpoints your business depends on most, and track them separately.
Worker saturation is the other high-value signal. Odoo runs on a fixed pool. When all workers are occupied with slow requests, new requests queue or fail. A saturated pool usually appears before users start complaining, and it is almost always caused by a small number of slow endpoints or a long-running scheduled action holding a connection.
PostgreSQL slow query logs are the most underused diagnostic tool in Odoo operations. Most ERP performance problems at the database layer are not caused by complex queries but by the same moderately expensive query running thousands of times per hour.
Building Grafana Dashboards That Tell a Story
A useful dashboard answers a question, not just displays data. The most effective Odoo dashboards are organized around operational workflows, not system components.
Three dashboards cover most production needs. A system health overview with worker status, error rate, and queue job summary is what an on-call engineer opens first. A performance dashboard with endpoint response times, slow query counts, and worker queue depth is what an engineer uses during incident investigation. A business health dashboard with queue job failure rate by channel, cron job status, and integration success rate is what a technical lead reviews daily.
The most common mistake is combining all of these into a single panel-dense dashboard that nobody reads.
Alert Thresholds Grounded in Business Impact
Generic thresholds generate noise. An alert that fires because CPU reached sixty percent during an accounting close is not useful. An alert that fires because the queue job failure rate for outgoing invoices exceeded ten percent for fifteen consecutive minutes is a real problem.
Thresholds worth setting:
- Queue job failure rate above a defined percentage sustained for several minutes, by channel.
- Any cron job that has not completed successfully within twice its expected interval.
- 95th percentile response time for sale order confirmation exceeding a business-defined threshold.
- PostgreSQL slow query count per minute rising above a normal-operations baseline.
- Worker pool occupancy above ninety percent for more than five minutes.
Closing the Gap Between Feeling and Evidence
The real cost of poor ERP observability is not time spent investigating incidents. It is decisions made without data. Teams add servers without knowing whether the problem is concurrency or query volume. Engineers optimize the wrong endpoints because averages hide outliers. Business stakeholders escalate based on anecdote because there are no dashboards to show them.
Good monitoring does not prevent all problems. It shortens the time between a problem appearing and the team understanding it well enough to act.