The 2am Problem: Why Workflows Fail in Production
I once woke up at 2:47 AM to a notification that my Resume optimization workflow (Shadow Hound) had stopped working. A user had submitted a resume with a special character in their name that my scenario didn't expect. The API call failed, the scenario halted, and users were seeing error pages instead of optimized resumes.
That moment taught me that production reliability isn't optional—it's non-negotiable.
Here's what typically happens: A workflow runs perfectly 99% of the time. Then an edge case appears. Bad data arrives. An API times out. The workflow crashes silently, and nobody knows until customers complain. Or worse, until you wake up to it.
The difference between a scenario that breaks at 2am and one that gracefully handles problems is error handling. It's not flashy. It's not fun to implement. But it saves you from firefighting at 3am.
Error Handling Basics: The Safety Net Your Scenario Needs
In Make.com, every module can fail. A webhook call might timeout. An API might reject your request. A database might be unavailable. If you don't tell Make what to do when something fails, the entire scenario stops.
Error handlers are your first line of defense. They're simple: If this module fails, do that instead. Here's what I always add:
- API call failures. If I'm calling OpenAI, Google Sheets, or any external API, I wrap it with error handling. If it fails, I want to log the error, notify myself, and optionally retry.
- Missing data. If a critical field is empty, I should catch it early, not let it propagate through the scenario and corrupt downstream data.
- Database lookups. If I'm searching for a user in Airtable and they don't exist, that's not really an error—it's a valid case I should handle deliberately.
Here's my pattern: After every critical module, I add an error handler. The handler checks what went wrong, logs it to a spreadsheet or Slack, and either retries or gracefully continues.
For Shadow Hound, when the OpenAI API call fails, my error handler logs it to a dedicated "Failed Resumes" sheet and sends me a Slack notification. That way, I know immediately that something went wrong, and I can investigate. The user gets a helpful "Please try again in a moment" response instead of a blank error page.
Webhook Response Patterns: The Contract That Matters
When I build user-facing tools like Shadow Hound or Social Spark, they communicate with Make via webhooks. The frontend sends a request. Make processes it. Make sends a response. If the response isn't right, the user sees a broken experience.
The webhook response module is critical. It's the last thing that runs before your user sees a result. Here's what I always do:
- Always return HTTP 200. This tells the client that the request was received, even if something failed internally. I don't return 5xx errors because that indicates a server problem. Instead, I return 200 with a success flag in the body.
- Include a status field.
{"success": true}or{"success": false}. The frontend checks this flag to know whether to show the result or an error message. - Return useful data or error messages. If the workflow succeeded, include the result (optimized resume, generated post, etc.). If it failed, include a human-readable error message, not a technical stack trace.
- Add a request ID. I include a unique identifier that matches the log entry. If something goes wrong, I can trace the exact request through my logs.
For Social Spark, my webhook response looks like this:
{"success": true, "post": "Generated post content here...", "request_id": "uuid"}- or
{"success": false, "error": "Topic cannot be empty", "request_id": "uuid"}
This contract between Make and my frontend is crystal clear. The frontend always knows what to expect.
Testing Before Going Live: The Checklist That Saves You
Before activating a scenario for real users, I test it relentlessly. Not just happy paths. Edge cases. Failures. The scenarios that break at 2am are the ones where I skipped testing.
My testing checklist:
- Test with valid data—does it produce the expected result?
- Test with missing fields—does it fail gracefully or crash?
- Test with unexpected data types—what if someone puts text where I expect a number?
- Test with very large data—what if a resume is 50 pages?
- Test with special characters—quotes, accents, emojis. Unicode breaks things.
- Deliberately break an API call—does the error handler trigger?
- Deliberately timeout a request—does the scenario handle it?
- Check the webhook response format—is it valid JSON?
This takes time. For Shadow Hound, I probably tested 200+ times before it was production-ready. But that investment paid off. It's now been running for months without a hiccup.
Monitoring and Alerts: Knowing When Something Goes Wrong
You can't be awake all the time. So you need to know immediately when something breaks. I use Make's built-in execution history, but I also set up external monitoring.
I log everything that matters. Every API call, every error, every edge case. I log to a Google Sheet, and I review it daily. If something has failed more than twice, I investigate.
I set up Slack alerts. Critical errors send me a Slack message immediately. Not all errors—just the ones that indicate a real problem. Failed API calls, database errors, scenarios that crash. I check Slack in the morning and know instantly what needs attention.
I monitor operation counts. If a scenario suddenly uses 10x more operations than usual, something is wrong. Maybe it's looping infinitely, or the data changed. Monitoring operation counts helps me catch runaway scenarios before they blow my budget.
For all 7 of my tools, I have a "monitoring dashboard" spreadsheet that tracks daily operation usage, error counts, and any issues. It takes 5 minutes to set up and saves hours of debugging.
Data Validation: Catching Problems Early
The best errors are the ones you prevent from happening in the first place. Data validation catches problems before they propagate.
Validate incoming data immediately. As soon as a webhook or email arrives, check that required fields are present and in the right format. If they're not, reject the request with a clear error message. Don't let bad data into your scenario.
Validate at intermediate steps. Before I pass data to an external API, I verify it. Does the email look valid? Is the number positive? Is the text not empty? These checks take seconds to add but prevent hours of debugging.
Use Make's built-in validation. When creating data in a module, Mark can validate the data structure. I use this for Airtable records, database inserts, and webhook responses. Let Make catch the errors before they happen.
In Shadow Hound, I validate the resume file immediately. Is it a PDF? Is it not empty? Is it under the size limit? If validation fails, I send the user a clear message about what to fix, not a generic error.
The Reliability Checklist: Before You Activate
I use this checklist before every scenario goes to production:
- Does every module that can fail have an error handler? Yes/No
- Have I tested with invalid data? Yes/No
- Have I tested with missing fields? Yes/No
- Are special characters handled? Yes/No
- Does the webhook response format match what the frontend expects? Yes/No
- Have I checked for runaway loops? Yes/No
- Is operation usage reasonable? Yes/No
- Are critical errors logged? Yes/No
- Have I set up alerts for failures? Yes/No
- Have I tested with realistic data volumes? Yes/No
If I can't check "Yes" for all of these, the scenario isn't ready. It will break at 2am, and I'll regret not taking 30 minutes to do this right.
Lessons Hard Won (And How to Avoid Them)
Lesson 1: Never assume clean data. Real data is messy. Names have special characters. Fields are sometimes empty. APIs return unexpected formats. Test with real data from your actual sources, not sanitized test data.
Lesson 2: Error handling is not optional. Every scenario will fail eventually. The question is whether it fails loudly (so you can fix it) or silently (so users complain). I choose loudly every time.
Lesson 3: Monitoring pays for itself. The 5 minutes I spend setting up alerts and logs saves me hours of firefighting later. Invest in observability.
Lesson 4: Test before activating. I know it's tedious. I know you want to just flip the switch and see it work. But 30 minutes of testing prevents days of debugging. Future you will be grateful.
Lesson 5: Graceful failures are better than crashes. If something goes wrong, don't let the whole workflow stop. Log it, notify yourself, and provide the user with a helpful error message. Graceful degradation is the sign of a professional system.
Build for the failure cases, not the happy path. Your users will never see the scenarios that work perfectly—they'll only notice the ones that break. Make it so that when things go wrong, they go wrong gracefully.