I’ve written a lot about using IBM i System Monitoring and Notification software to alert on-call responders when problems occur on your IBM i partitions. I put together a comprehensive list of the possible issues organizations can monitor for in their IBM i environment, using a system monitoring package.
Here are 25 different system issues that most IBM i system monitoring packages can detect. Once detected, these packages can either send out alerts to on-call responders or automatically attempt to correct an error before sending out an alert.
- Hardware status – Monitoring can check for hardware issues with your IBM Power hardware, including disk, backup device, backplane, or other hardware failures.
- Disk drive utilization beyond thresholds – Monitoring can send out alerts when disk utilization passes a target threshold. Alerts can be sent out when utilization passes 80%, 85%, or 90% full or for any other disk usage milestone that you want to monitor for.
- Disk drive usage spikes – Many software packages can be configured to issue an alert whenever disk usage increases by a certain percentage over a specific time period. For example, a disk usage increase of 5% over an hour might indicate a malfunctioning program.
- RPG, CLP, PHP, and other program error messages – Responders can be alerted when there’s an applications inquiry message that needs answering. This is the most common use for system monitoring packages.
- Job status changes – You may also want your monitoring package to detect jobs that are waiting on record locks or jobs in held status, alerting responders to changes in running job statuses that can affect job processing.
- CPU utilization – Alerts can be sent out when CPU utilization is consistently running above a specific percentage (i.e. if the CPU is running at greater than 90% all the time, it could be a sign your machine is CPU-bound). These alerts can tell you when the system is under heavy workloads, possibly indicating that the system may need additional CPU.
- Individual batch or interactive jobs using excessive CPU – Monitoring the CPU utilization of individual jobs and alerting you when a job’s CPU utilization exceeds a specified level. Batch CPU alerts can indicate looping or malfunctioning jobs
- Disconnected jobs – System monitors can look for and automatically end disconnected telnet jobs or devices. A disconnected job may lock records, loop, or interfere with critical processing.
- Ping testing for companion servers on other IBM i partitions, Windows, or Linux servers – Software monitoring packages can ping a partner server to determine if that server is available. Unanswered pings can issue an early warning that a co-processing server is down.
- Ethernet failures– Monitoring software can check that important Ethernet lines, controllers, devices, and IP addresses are up and running, designating network connection issues or Ethernet port issues.
- Jobs running longer than expected – System monitoring can compare the current run time for any job with its expected run time to let you know when a job is running longer than expected. Unusually long run times may indicate looping or out-of-control jobs.
- Jobs running shorter than expected – Monitoring to see if a job is running shorter than its expected run time and whether it finished faster than expected. Short-running jobs may indicate missing data, a malfunctioning job, or an error in a job’s processing parameters.
- Expected jobs not running – Monitoring for jobs that didn’t start executing at their expected times. This can indicate problems in a daily, weekly, or monthly job stream that can affect a critical batch window.
- Unexpected jobs are running on the system – A job is running outside of its normal processing window. This may indicate a problem with job scheduling, a job that was submitted by mistake, or jobs running out of sequence.
- Subsystem not available – Monitoring whether a required subsystem is not running. This may indicate a required function such as the QINTER subsystem or a third-party application isn’t available.
- Too many jobs in job queue – Generating alerts when the number of jobs waiting in specific job queues are piling up waiting to run. This could indicate issues with subsystem processing, such that jobs that need to process quickly (including incoming orders) are not processing in a timely manner.
- Printer maintenance, including form changes, alignment, and paper out messages – Monitoring can send out alerts for printer attention items such as form changes, paper alignment, or out of paper. Different operations or departmental personnel can be alerted to handle these common printer maintenance functions.
- Excessive spooled files in an output queue – Monitoring output queues to see if there are an excessive number of spooled files waiting to print. Excessive printouts may indicate that a job may be in loop, putting out multiple job logs or extra spooled files. It can also indicate possible printer problems where output is not being sent to a printer.
- User profiles varied off – A varied off profile can designate a possible security issue, such as someone trying to sign on to your system as another user, indicating an external attack or an internal user trying to sign on and use another profile.
- IBM system user profiles signing on – Monitoring software can determine whether QSECOFR or another restricted IBM i system profile such as the service user, has signed on.
- Specific devices varied off – Varied off devices can indicate that critical hardware, such as a backup drive, shipping printer, scanner, or controller is no longer connected to the system.
- No media in a media device – Most system monitoring packages can detect an empty media drive. This predicts that a daily, weekly, or monthly backup may run into issues because media has not been loaded.
- Backup media not formatted – Monitoring software can determine whether a drive contains an initialized tape or other backup media.
- IBM i partition switched to UPS power – Detecting when your IBM i operating system has switched to a backup uninterruptable power supply (UPS) due to an electrical outage. This informs responders that electrical power has been lost to your IBM Power hardware, possibly kicking off a disaster recovery or high available switch scenario.
- IBM i partition UPS power batteries are low or depleted – A system monitoring package can alert you when there are problems with the UPS connected to your system, allowing the responders to inspect and fix failing UPS equipment.
System monitoring packages are flexible and can creatively handle any number of IBM i system and application issues. For more information on how to specifically configure your own system monitoring package to meet your needs, see the helpful articles related to this post below or contact us at DRV Tech for more information.
Learn more about the messages and alerts available to you through MessageFlex by DRV Tech.