All articles
·8 min

How to monitor your existing integrations to prevent downtime and data loss

Your integrations work. But do you actually know when they don't?

You've built the integrations. The CRM syncs with the ERP, invoices generate automatically, orders flow into warehouse management. Everything looks fine — until a customer calls asking why their order hasn't been processed for 3 days.

The investigation reveals that an external API changed its response format on Tuesday evening. The sync failed silently. Nobody noticed because there were no alerts.

This is the reality for most companies: they invested in automation but didn't invest in monitoring the automation. The difference between the two is like driving a car with a dashboard versus without one — you're going the same speed, but you have no way to know the engine is overheating.

Why integrations fail silently

Unlike a web application that displays a visible error, system integrations fail in subtle ways:

  • Intermittent errors: an API returns a timeout once every 50 requests. Data goes missing randomly.
  • Slow degradation: response times creep from 200ms to 3 seconds. Nobody notices until processing queues jam up.
  • Format changes: an optional field becomes required, or an enum gets a new value. The integration doesn't crash — but it processes incomplete data.
  • Authentication issues: a token expires, an SSL certificate doesn't renew. Sync stops without a sound.

A 2025 Gartner study estimates that unplanned downtime costs mid-market companies an average of EUR 4,600 per hour. But the real cost of silently failing integrations is much higher — because the problem isn't visible downtime, it's missing data you discover weeks later.

The 4 levels of integration monitoring

Level 1: Health checks (availability monitoring)

The simplest and most immediately deployable. An automated job periodically verifies that each external endpoint responds correctly.

Practical implementation:

  • Ping each external API every 5 minutes
  • Check not just the status code (200) but also the response structure — does it return the expected fields?
  • Alert via Slack or email if a check fails 3 consecutive times

Cost: EUR 0 if you use UptimeRobot (free plan, 50 monitors) or EUR 50-100/month for Better Uptime with incident management.

Watch out: a health check that only verifies the HTTP status is insufficient. We've seen cases where the API returned 200 OK, but the body contained an error message in plain text. Always validate the response structure, not just the code.

Level 2: Centralized structured logging

Every integration should write structured logs (JSON) to a centralized location. Not text files scattered across different servers — one single place where you can search and filter.

What to log per request:

  • Timestamp, integration_name, direction (inbound/outbound)
  • Request payload (sanitized — no personal data or credentials)
  • Response status, response time, error message (if any)
  • Correlation ID to link requests across systems

Recommended tools:

  • Grafana + Loki: open-source, self-hosted, costs only infrastructure (~EUR 30/month on a VPS)
  • Datadog: managed, Pro plan from EUR 15/host/month — excellent but can get expensive fast
  • Axiom: a modern alternative with ingest-based pricing, from EUR 0 (generous free plan)

In a NEXVA SYSTEM project for a distributor with 12 active integrations, implementing centralized logging reduced average problem diagnosis time from 4 hours to 15 minutes. The gain wasn't just technical — the support team stopped spending half their day investigating issues that now take a few clicks to identify from the dashboard.

Level 3: Business metrics (not just technical metrics)

Technical monitoring tells you the API responds. Business metrics tell you the integration is actually working.

Concrete examples:

  • Orders synced per hour: if the average is 45 and you suddenly see 12, you have a problem — even if the API returns 200
  • Stock discrepancy: compare ERP inventory with e-commerce inventory every hour. Differences above 2% = alert
  • Invoices generated vs. orders completed: the ratio should be 1:1. Any deviation signals a pipeline problem
  • Average sync duration: if a sync that used to take 30 seconds now takes 5 minutes, that's a degradation signal

Recommended alerting thresholds:

| Metric | Yellow alert | Red alert |

|---|---|---|

| Error rate | > 5% over 15 min window | > 15% over 5 min window |

| API latency | > 2x normal average | > 5x normal average |

| Data volume | < 50% of daily average | < 20% of daily average |

| Failed syncs | 3 consecutive | 5 consecutive |

These thresholds are starting points — adjust them based on each integration's specifics. An integration that syncs inventory 10 times a day has different thresholds than one processing payments in real time.

Level 4: Smart alerting (without alert fatigue)

The most common reason monitoring fails isn't a lack of alerts — it's too many alerts. The team gets 30 notifications a day, ignores all of them, and misses the critical one.

Rules for effective alerting:

  • Gradual escalation: first alert in Slack. If unresolved in 30 minutes, SMS. If unresolved in 60 minutes, phone call.
  • Grouping: bundle related alerts from the same integration into a single incident
  • Maintenance suppression: temporarily disable alerts during deploys or planned maintenance
  • Attached runbook: every alert includes a link to troubleshooting documentation
  • Monthly review: analyze last month's alerts. If an alert fired 20 times without requiring action, either adjust the threshold or remove it

A simple rule: if an alert doesn't require immediate human action, it shouldn't be an alert. Turn it into a log entry or a weekly report.

Implementation checklist (in priority order)

1. Week 1: Set up health checks for all external APIs (1-2 hours of work)

2. Week 2: Implement centralized structured logging (2-3 days of development)

3. Weeks 3-4: Define and implement business metrics for your top 3 critical integrations

4. Month 2: Configure alerting with escalation and runbooks

5. Month 3: Centralized dashboard with real-time visualization of all integration states

Total estimated budget: EUR 3,000-8,000 for full implementation (all 4 levels), depending on the number and complexity of integrations. Most of the cost comes from levels 2 and 3 — health checks take a few hours to set up, but centralized logging and business metrics require real development work.

The practical takeaway

Integration monitoring isn't a luxury — it's insurance. The implementation cost is 10-20x lower than the cost of a major undetected outage. And the difference between a company that responds to problems in 5 minutes and one that discovers them after 3 days is massive — both financially and in customer relationships.

If you have critical integrations running without monitoring, book a free consultation. At NEXVA SYSTEM, every monitoring project starts with an audit of existing integrations — because you can't monitor what you don't understand.

Want to discuss automating your processes?

Book a consultation