This is the article for the IT leader who arrived at this page because things are broken right now. No theory. No frameworks. Triage actions.
The First 48 Hours
Stop the bleeding. These are not suggestions — these are emergency protocols.
Identify the top 5 tickets by business impact (not age). Your oldest ticket is not necessarily your most important ticket. Find the five requests that, if unresolved, will cost the business the most. These are your only priorities for the next 48 hours.
Freeze non-critical change requests. Everything that is not keeping production running or protecting revenue stops immediately. This includes that "quick enhancement" someone promised to a VP. It stops.
Establish a daily standup with a 15-minute hard cap. Not 30 minutes. Not "however long it takes."
Fifteen minutes.
Three questions per person: What did you resolve? What are you working on? What is blocking you? Anything else goes to a separate conversation.
Create a single shared view of active incidents. One board. One source of truth. If it is not on the board, it is not being worked on. If it is on the board, everyone can see its status.
The First 2 Weeks
You have stopped the hemorrhage. Now assess the damage.
Calculate your unplanned work ratio. For the next two weeks, track every hour. Categorize as planned (was on this week's plan) or unplanned (was not). The ratio tells you how much capacity you actually have for strategic work. If unplanned exceeds 40%, you are in structural crisis — not just a bad week.
Identify your top 3 repeat-failure patterns. Which incidents keep recurring?
Password resets? The same integration breaking? The same batch job failing? These repeat failures are consuming enormous capacity. Document them.
Audit after-hours coverage gaps. When was the last time a P1 incident hit outside business hours?
What happened? How long before someone responded?
If the answer involves "we got lucky" or "Bob happened to be awake," you have a coverage gap that is a ticking time bomb.
Document who knows what (tribal knowledge map). List every critical system. Write down who can fix it when it breaks. If any system has only one name next to it, you have a single point of failure. If that person leaves, retires, or gets sick, you lose that capability entirely.
The First 30 Days
You have stabilized. Now build the foundation for recovery.
Establish a triage protocol. Every incoming request gets classified: Critical (production down, revenue impact), Urgent (degraded but functional), Normal (enhancement, maintenance). Classification determines priority. No exceptions. No "everything is critical."
Implement basic ticket routing discipline. Requests go to the right person the first time.
If your team spends more than 10% of their time re-routing misassigned tickets, your intake process is broken.
Create a minimum viable runbook for the top 10 recurring issues. Not perfect documentation. Not comprehensive SOPs.
A one-page guide for each of the 10 most common incidents: what the user reports, what is actually happening, how to fix it, how to prevent it next time.
Identify one process that can be automated immediately. Not the most complex automation. The most frequent one.
If password resets consume 200 tickets per month at 15 minutes each, that is 50 hours of capacity you can recover with a single automation investment.
When to Call for Help
These are the indicators that internal capacity cannot recover without external support:
- Unplanned work exceeds 50% of total effort. Your team is spending more time firefighting than executing.
No amount of prioritization will fix this — you need additional capacity to absorb the operational noise.
Ticket aging exceeds 7 days on average. Aging tickets compound — each unresolved ticket generates follow-up inquiries, escalations, and workarounds that consume additional capacity.
More than 2 critical roles are single points of failure. If two key people leaving would cripple your operation, you are one resignation away from crisis.
What Not to Do
Do not hire in a panic. You will get the wrong people. Panic hiring optimizes for speed, not fit.
You end up with contractors who need 3 months to ramp up and leave 6 months later, taking whatever knowledge they acquired with them.
Do not add more tools. Your problem is not a tool gap. It is a capacity gap.
Adding a new monitoring tool, a new ticketing system, or a new automation platform adds complexity and learning curve at exactly the moment your team has zero bandwidth for either.
Do not reorganize. Restructuring your team while they are drowning does not help them swim. It adds confusion and uncertainty to an already stressed environment. Stabilize first, then optimize.
Do not schedule a strategy offsite. Your team does not need a whiteboard session about the future. They need the present to stop being on fire. Strategy comes after stability.
*This field guide is based on the Structured Execution framework used across 200+ enterprise IT environments.
The protocols are designed for immediate deployment without tooling changes or budget approval.*