This case study details the value delivered under F3 Technology Partners’ Managed Services to a member owned and operated full service financial institution in North America with over $600 million in assets. The primary business application is DNA from Fiserv, a core banking platform using Oracle database on the backend. F3 Technology Partners was contracted to provide remote monitoring and support services for the Oracle database backing DNA.
In early October 2014, monitoring went live at the customer and the F3 Level 1 support desk was flooded with alerts from the customer’s systems. Prior to monitoring implementation the client had no visibility into database system operation and was completely unaware of the underlying issues flagged in the alerts. The F3 Level 1 team quickly escalated the received alerts to the F3 DBA team, who in turn identified several critical problems:
- Database name resolution not working on local standby
- Local standby database not in sync with production
- Remote (DR) standby database not in sync with production
- Disk on primary production database server consistently 99-100% busy
Over the following days and weeks, F3’s DBAs worked closely with the customer’s network administration team to resolve and make recommendations for each identified issue.
Database name resolution was fixed immediately the same day with simple updates to the TNSNAMES.ORA file on the affected host.
Having a functional standby database was identified as the next priority, and within a few days the local standby was rebuilt during weekend off-peak hours. F3’s DBAs remained in contact with the client throughout the process, providing status updates and confirmation once the rebuild had completed successfully.
The following week, attention turned to the remote standby. Initial efforts to recover failed due to a slow WAN link between the production and DR site that could not push data fast enough to catch the standby up to production. The F3 team continued to coordinate with the client as they physically moved the server from their DR data center – a one-way 3 hour drive - to the production site. Once there, the entire host was rebuilt, database systems were reinstalled, reconfigured, and resynced across the LAN. With everything back in good working order, the server was redeployed back to the DR site, and sync was resumed. With only 3 hours’ worth of activity to catch up on, the remote standby was quickly back in sync with production.
The situation with both standby servers was created in part by the client’s standby recovery process, which entailed manually shipping archive logs from production to standby, then issuing the RECOVER DATABASE command on the standby site. This is not only a clunky method, but one that can result in transactional data loss and downtime while the standby is brought in sync in the event a failover is needed. The above process can be completely automated and managed using Oracle Data Guard, a component of Enterprise Edition the client had bought and paid for, but was not using. As a proactive step to prevent similar situations in the future, F3 enabled and configured Data Guard, keeping both standby servers only a few seconds-to-minutes behind production.
With standby servers back up and running, F3 began analysis on the disk busy alerts. End users at the customer had not been reporting slow application performance, but that level of disk activity was concerning and provided little overhead in the event of increased load. By closely monitoring the database during operation, F3 identified a single SQL statement that was accounting for the majority (~60%) of the load. This information was provided to the client to escalate to Fiserv, the application vendor. F3 also identified placement of all Oracle data files on a single disk spindle as a major bottleneck, and provided a number of recommendations to the client admin team to resolve.
Since those first few weeks, database systems at the customer have been running much more smoothly. F3 continues to work closely with the client to identify and resolve potential issues as they arise.