Production support

Five Things to Check Before Escalating a T24 Incident

The difference between a junior analyst and an experienced one is often not knowledge — it is ten minutes and a short checklist. Here are the five things worth doing before you escalate, so that the person you call does not spend the first five minutes of the conversation asking you questions you could have answered yourself.

For T24 support analysts, first-line operations staff, and anyone who has ever been on the receiving end of an escalation that began with “it's not working”

Production support

There are two kinds of escalation call. The first kind begins with a clear description of what is happening, what has already been checked, what the logs say, and a working theory about what is wrong. The person being called can immediately do something useful. The call is short. The problem is resolved.

The second kind begins with “something seems wrong with T24.” The person being called spends the next fifteen minutes asking questions that could have been answered before the call was made. In some cases, the act of answering those questions reveals the solution. In all cases, it reveals that the escalation was, in hindsight, premature.

The five checks below will not catch every problem. They will catch enough of them that your escalations become the first kind, which is better for everyone — including your reputation as someone who knows what they are doing.

1. Actually read the log message

This sounds obvious. It is, apparently, not obvious enough, because the single most common question on any T24 escalation call is: “what does the log say?” And a meaningful proportion of the time, the answer is a pause followed by the sound of someone opening a log file for the first time.

The logs contain the answer more often than you might expect. Not always — sometimes the log message is genuinely cryptic, or points in the wrong direction, or is so verbose that the relevant line is buried somewhere between fourteen thousand lines of routine output. But often the log contains a specific error message, a specific record ID, a specific service that failed, and occasionally, if you are fortunate, something that reads almost like an explanation.

In TAFJ, the relevant logs are not always where you expect them. TemnLogger output, application server logs, and JVM logs are in different locations and at different verbosity levels. Knowing which one to look at for which type of problem is a skill worth developing before the 2am incident rather than during it.

Before escalating: open the relevant log, find the error, read it. If you do not understand what it means, that is fine — bring it with you when you escalate. “The log says X and I am not sure what it means” is a useful starting point. “I have not checked the log” is not.

2. What changed in the last few hours

Most T24 incidents are caused by a change. Not always a deliberate one — sometimes it is a COB that ran differently, a scheduled job that ran at an unexpected time, or a batch process that completed and left something in a state it should not be in. But in a large number of cases, something happened recently, and the incident is a consequence of it.

The question worth asking is: what was different about this time compared to the last time this worked? A code deployment in the last few hours is the obvious candidate. A COB that completed earlier or later than expected is another. A configuration change, a scheduled task, a manual operation that someone performed and did not mention — all of these are worth checking before concluding that T24 has simply decided, of its own accord, to behave differently for no reason.

The slightly uncomfortable aspect of this check is that the change was sometimes made by someone who is now very keen for it not to be relevant. Ask anyway. “Has anything changed in the environment in the last few hours” is a reasonable question, and the answer will either help you find the problem or usefully rule out a category of cause.

3. Is it one thing or everything

There is a significant difference between “T24 is down” and “one user cannot complete one transaction.” Both are described, with some frequency, as “T24 is not working.”

Before escalating, establish the blast radius. Is this affecting all users, or one user? All transactions, or one type? All accounts, or one account? All channels, or one channel? The answer to each of these questions narrows the problem substantially and points toward a different set of causes.

A problem that affects one user is almost certainly a user, configuration, or permission issue. A problem that affects one transaction type is probably something in the processing path for that transaction. A problem that affects everything simultaneously is a different and more serious category of incident and warrants a different response.

This check also protects you from the scenario where you escalate an apparent systemic failure and the first thing the person you called does is try it themselves and find that it works. This scenario is more common than anyone who has experienced it would like to admit.

4. Is the service actually running

This is the T24 equivalent of the question that launched a thousand helpdesk memes, and it is on the list for the same reason that question is famous: it works. A non-trivial percentage of T24 incidents are caused by a service that is not running, and a non-trivial percentage of those could have been resolved in the time it took to escalate them.

In TAFJ, the relevant services are the application server, the database connection pool, and any middleware components the affected process depends on. In TAFC, the T24 service itself. The symptoms of a stopped service can look like a variety of other problems, which is why it is worth checking explicitly rather than assuming that because some things are working, the relevant service must be running.

Check the service status. Check that the application server is accepting connections. If you are dealing with a TAFJ environment, check the MW42 session monitor to see whether active sessions exist. None of these checks take more than two minutes. All of them occasionally produce the answer immediately.

5. Has this happened before

T24 environments develop patterns. The same problems recur, often at the same points in the processing cycle, triggered by the same conditions, with the same resolution. The people who have been supporting the environment longest know these patterns intuitively. The people who arrived recently do not — yet.

Before escalating, check whether the incident has a history. Previous tickets, incident records, runbooks, or simply asking someone who has been around longer: has anyone seen this before? If the answer is yes and there is a documented resolution, the escalation is probably unnecessary. If the answer is yes and the previous resolution was “it resolved itself after COB,” that is also useful information. If the answer is no and nobody has seen it before, that is valuable context for the escalation — it tells the person you are calling that this is genuinely new rather than a recurring known issue.

The absence of incident history does not mean the problem is serious. The presence of incident history does not mean it is not. But knowing which situation you are in before you escalate means the conversation starts from a better place.

What a good escalation actually sounds like

When you have worked through the five checks, an escalation sounds something like this:

“We have a batch job that is failing. The log shows an OFS error on a specific account — I can send you the exact message. It started after tonight's COB, which ran about forty minutes later than usual. It is affecting three accounts that we can identify so far, not all batch. The service is running. We have seen OFS errors on this batch before but not with this specific error code.”

Compare that to: “batch is failing.”

The first version gives the person being called something to work with immediately. The second version starts a conversation that will eventually arrive at the same information, but via a route that takes longer and involves more people being woken up than necessary.

None of the five checks require deep technical knowledge. They require the habit of looking before calling. That habit, reliably applied, is one of the most visible differences between a support analyst who people are glad to work with and one who people brace slightly when they see calling.