Production support
The First 30 Days After a TAFJ Go-Live
The project team has declared success and moved on to their next engagement. The environment is live. The handover document runs to forty-seven pages and answers several questions nobody asked while leaving others untouched. This is what the operations team actually encounters in the first month — and what to do about it.
The go-live moment has a particular quality to it. There is a period of heightened attention — everyone watching, tickets being raised and resolved with unusual speed, the project team still present and still invested. Then, gradually, the scaffolding comes down. The project manager has other projects. The technical leads have other engagements. The hypercare period, which was defined as four weeks but quietly became two, concludes. And the operations team is left with a live TAFJ environment, a set of runbooks of variable quality, and the knowledge that the next incident belongs to them entirely.
What follows is a reasonably accurate account of what the first thirty days tend to produce, based on the observable patterns that repeat across TAFJ go-lives with the kind of consistency that suggests they are structural features rather than bad luck.
Week one: the things that break immediately
The first week after go-live has a predictable character. The production environment is under real load for the first time. Real users are doing things that the test team did not do, with real data in quantities that the test environment did not have, at times of day that the test schedule did not cover. The gaps between what was tested and what production actually looks like begin to become apparent.
The JVM behaves differently under real load. Heap sizing that was adequate in a test environment running at a fraction of production volume often turns out to be optimistic once the full transaction load arrives. Garbage collection starts running more frequently. Response times begin to drift. In some cases the application server slows to a point where it becomes a noticeable problem before anyone has identified the cause. This is not a failure of the migration — it is a consequence of the test environment not being able to fully replicate production load. It does, however, require someone who knows how to read JVM metrics and adjust heap configuration, which is a skill that the TAFC operations team may not have needed before now.
The monitoring shows the wrong things. In TAFC, the operations team knew where to look. In TAFJ, the monitoring is different — Grafana dashboards, TemnLogger output, JVM metrics — and the dashboards that exist were configured by the project team to show what the project team found useful during testing. This is not the same as what the operations team needs to see to run the environment day to day. Alerts that should fire do not. Metrics that matter are not on the dashboard. The operations team is watching the environment through a window that was positioned for someone else's view.
Nobody can find the log files. This sounds like a small thing. In practice, when something goes wrong at 11pm and the first-line analyst is trying to triage, the difference between knowing where the TAFJ logs are and not knowing is the difference between a twenty-minute diagnosis and a two-hour one. TemnLogger output, application server logs, and JVM garbage collection logs are in different locations from the TAFC logs they replaced. The runbook says where they are. The runbook may not be the first thing the analyst reaches for.
Weeks two and three: the things that break quietly
The second and third weeks tend to surface a different category of problem — not the immediate failures that announce themselves loudly, but the quieter ones that produce wrong results without producing errors. These are, in some ways, harder to deal with.
Batch jobs that complete successfully and produce incorrect output.The TAFJ selection layer processes data differently from TAFC in certain cases — particularly where batch routines use derived DICT fields or date-based filters. A routine that compiled, tested, and ran without errors in the test environment may produce a subtly different record set in production, where the data combinations are more varied. The job finishes. The output file exists. The downstream system receives it. Nobody notices until someone compares a report and finds that the numbers do not match.
Interface timing issues that were not visible in testing.COB in TAFJ does not always run on exactly the same schedule as COB in TAFC, and the differences — even small ones — propagate into the timing of file exports, downstream triggers, and the windows that counterparty systems expect files to arrive within. In testing, these windows were rarely stressed. In production, a file that consistently arrives seven minutes later than the downstream system expects will eventually cause a problem. It usually takes two or three weeks of production data to make the pattern visible.
Local code that worked in testing and does not work on production data.The test environment had clean, controlled data. Production data has edge cases — records created years ago under different rules, accounts with unusual configurations, transactions with combinations of fields that the test team did not think to include. A local routine that was thoroughly tested against representative data can still fail on production data that it was never shown. These failures often appear as occasional errors rather than systematic ones, which makes them harder to diagnose and easier to deprioritise until they are not.
The DBTools problem
This one deserves its own section because it affects every member of the operations and support team, it affects them repeatedly, and it is entirely predictable.
In TAFC, querying data in T24 meant running a SELECT in the command line. It was fast, familiar, and available to anyone who knew the application. In TAFJ, SELECT is gone. The replacement is DBTools — a separate console with its own authentication, its own syntax for certain operations, and a default result limit of 200 rows that it does not mention unless you already know to ask.
The practical consequences in the first month: support analysts who have spent years using SELECT reach for it instinctively and find it is not there. They switch to DBTools and run queries. The query returns 200 rows. They conclude either that only 200 records exist or that the record they are looking for is not there. In some cases this leads to incorrect conclusions about the state of the data, which leads to incorrect actions, which leads to a different problem that is harder to explain than the original one.
The fix is straightforward once you know about it — DBTools has a row limit setting that can be adjusted. It is the kind of thing that should be in the runbook, clearly, near the front, before the forty-three pages about the compilation pipeline.
The knowledge problem
By the end of the first month, a pattern will have become clear: there are one or two people on the team who understand the TAFJ environment, and everyone else is asking them questions. This is a natural consequence of how migrations work — the people most involved in the project acquired knowledge that the rest of the team has not yet had time to build. It is also, if it is not addressed deliberately, a problem that compounds.
The one or two people who know the environment become the first call for every incident, every query, and every uncertainty. They answer the same questions repeatedly. They are on every escalation. They are, by the end of month one, tired in a way that is recognisable to anyone who has been in that position. And if either of them leaves, takes a holiday, or is simply unavailable at the moment something goes wrong, the team's ability to respond is significantly reduced.
The knowledge concentration problem does not resolve itself. It requires deliberate action: structured knowledge transfer, documented procedures, and the slightly uncomfortable discipline of making less experienced team members handle incidents with guidance rather than having the experts handle everything directly. The first thirty days are the best time to start this, because the incidents are happening frequently enough to provide learning opportunities and the environment is still novel enough that everyone expects to be learning.
What good looks like in the first thirty days
Operations teams that come through the first month well tend to have a few things in common. None of them are technically complex. Most of them require doing something rather than waiting for the situation to resolve.
They fix the monitoring before the first major incident, not after.The Grafana dashboards that the project team configured are a starting point. The operations team adds the alerts they actually need: JVM heap above a threshold, GC frequency above a threshold, COB stage duration outside expected range. This takes a day. It is the most valuable day of the month.
They write their own runbooks, not the ones they were given.The handover documentation describes the environment as it was designed to work. The operations team's runbooks describe how to deal with it when it does not. These are different documents. The first is written before go-live. The second can only be written after it, and the first month provides ample material.
They run COB as a team exercise, not as a solo task.Every member of the operations team who might be responsible for a COB failure overnight should have run COB themselves, not just watched. The first month provides the opportunity. Waiting for the first 2am incident is not the same as being prepared for it.
They establish a clear escalation path while the project team is still reachable. Hypercare periods have a tendency to end faster than expected, and the project team's availability after that is a matter of goodwill rather than obligation. Before the formal period ends, the operations team should know who to call for what, have tested that those people respond, and have documented the answers to the questions that have already come up once so they do not need to come up again.
They treat every incident in the first month as a knowledge transfer opportunity. The incidents will happen. The question is whether they leave the team more capable than before or simply more tired. Writing up what happened, what the cause was, and what to look for next time takes twenty minutes per incident. Over a month of first-time encounters with a new environment, it produces a reference document that is worth considerably more than the handover documentation they were given.
What month two looks like if month one went well
By the end of thirty days, the environment is no longer new. The initial surprises have been encountered and, mostly, resolved. The monitoring reflects what actually needs monitoring. The runbooks describe what actually happens. The team knows where the log files are.
The TAFJ environment does not become simple in month two. It does become familiar. The difference between a team that treated the first month as an emergency to survive and one that treated it as an environment to understand is, by month two, quite visible. One of them is still reacting to the same kinds of incidents it faced in week one. The other is catching them before they become incidents.
None of what is described here requires the operations team to become TAFJ experts. It requires them to become competent operators of the specific TAFJ environment they have been given — which is a different and considerably more achievable thing. The first thirty days are when that competence is built, whether deliberately or not. Deliberately is better.
