Most people use IBM i System Monitoring and Notification software (System Monitoring) to alert on-call responders for items needing manual intervention such as program errors, disk drive spikes, and runaway jobs.
Beyond this common functionality, there are three configurations you can use in IBM i System Monitoring software like my MessageFlex software, to automatically detect and deal with unexpected downtime.
Expected and unexpected downtime
Downtime occurs when your IBM i system isn’t available to users when it was promised for production processing.
The most common downtime situations happen during a restricted job stream such as an end-of-day batch jobs, daily or weekly backups, or weekly file maintenance. Users are usually locked out of the system when these job streams are running, because your data files are in use and can’t be changed.
Restricted job streams usually run during a maintenance window, such as from 2:00 AM to 6:00 AM every weekday, which is your expected downtime. In many companies, production starts at 6:00 AM, when orders start shipping and the system needs to be available. If a restricted job stream runs too long and locks the system past its 6:00 AM start of production, it can cause unexpected downtime and trigger some harsh consequences, including:
- Staff on site who can’t access the system and do their jobs
- Orders that can’t be shipped to customers expecting just in-time delivery
- Shippers unable to pick up a scheduled shipment on time
- Daily batch jobs crashing because files are still locked during the restricted job stream window
Here are three System Monitoring software configurations that can react to unexpected downtime situations like these before they occur.
Configuration #1 – Detecting when key jobs are running long or not running at all
During restricted job stream lockouts, a System Monitoring package can determine when key jobs are running late. Let’s say your end-of-day job stream starts at 2:00 AM and locks the system after orders are closed.
If it’s the holiday season and you’ve been slammed with twice your normal order volume, you can set up monitors to determine whether your end-of-day job stream started on time and alert on-call responders when it’s running late. You might set up monitors to detect the following situations for end-of-day kickoff monitoring.
- If the end-of-day kickoff job hasn’t started by 2:15 AM, alert a responder. It’s running late
- If the end-of-day kickoff job hasn’t finished by 2:30 AM, alert a responder. Something’s wrong.
You may want to monitor a few other key job run times (including jobs in the middle of the job stream) to determine whether the entire job stream is running off-schedule or whether only a particular job is running long.
Configuration #2 – Auto-answer record and file lock errors…to a point
To prevent timed job streams from slowing down and causing unexpected outages, you may want to use your System Monitoring software to automatically answer the following allocation error messages when they occur.
- RNX1218 in message file QSYS/QRNXMSG – Unable to allocate a record in file &7
- RPG1218 in message file QSYS/QRPGMSGE – &1 &2 is unable to allocate a record in &5
Most monitoring packages allow you to automatically answer messages with an ‘R’ (retry) without alerting a responder. If a single automated ‘R’ reply resolves your allocation issue, you’ve suffered no delay in completing the critical job stream.
For situations where the first retry response doesn’t allow the job to allocate a file or record, I’d recommend having your system automatically send up to three retries for any particular allocation error before alerting a responder. A job may have to retry a record or file allocation response more than once before the conflicting job gives up its allocation. Using three retries gives your system a chance to automatically keep a restricted job stream running without manual intervention. On the fourth try, the monitoring package can alert an on-call responder because the locked object the job is looking for may not be retrievable.
Configuration #3 – Check for critical post-job stream jobs and subsystems
Identify the jobs that must be up and running before your users can begin IBM i processing after the restricted job stream ends. Then set up monitors to ensure that those jobs are running immediately before the start of production.
If you shut down QINTER at the beginning of a restricted job stream and restart it at the end of the job stream, you could check for QINTER being up and running at 5:45 AM for a promised 6:00 AM start time. If the monitor doesn’t detect that necessary jobs or subsystems are up and running, have it notify your on-call responder. You can set up monitors to look for any job or subsystem that’s necessary for production processing.
While you can’t always catch every situation that will result in unexpected downtime, these three System Monitoring configurations can help you detect unexpected downtime caused by restricted jobs and help keep your critical job streams running on track. Be sure to check out the related articles below for more information on using IBM i System Monitoring and Notification software like MessageFlex to keep your IBM i partition running smoothly.