Restore Banner Database

Banner database restore procedures

The mission statement of Oakland University Technology Services emphasizes that we are dedicated to maintaining an “always on” computing environment that is “trustworthy”. System failures can be unpredictable, and having good backup and restore procedures are an essential part of ensuring that our systems, when they do experience failures, can be restored to service quickly with a minimum of data loss.

Oakland University currently has two different processes in place to support disaster recovery for the Banner databases:

  • Physical standby database - This is a copy of the database that runs on another database server. For Banner PROD, the primary database server is ea-banproddb-p01.sys and the standby server is ea-banproddb-p02.sys. On a regular basis (approximately every 30 minutes, or more often in the case of heavy update activity), a redo log switch is performed on the primary database, and this causes the log file to be propagated to the standby database server and applied there. Scripts run on the primary database server every hour to warn the Database Administrator when the standby databases get too far out of sync with the primary.
  • Oracle RMAN backups - On both the primary and standby database servers, the script /home/oracle/oakdba/rman_backups/rman_backup.shl is scheduled to run via the oracle user's crontab. The backups currently run on the following schedule:
    • TEST
      • Sunday at 1:00 am - Level 0 backup (full)
      • Monday-Saturday at 1:00 am - Level 1 backup (incremental)
    • AMTST8
      • Sunday at 3:00 am - - Level 0 backup (full)
      • Monday-Saturday at 3:00 am - Level 1 backup (incremental)
    • EMOMS
      • Sunday at 3:30 am - - Level 0 backup (full)
      • Monday-Saturday at 3:30 am - Level 1 backup (incremental)
    • DWTEST
      • Sunday at 4:00 am - - Level 0 backup (full)
      • Monday-Saturday at 4:00 am - Level 1 backup (incremental)
    • PROD
      • non-incrementals currently running daily starting at 2:00 am - to be replaced by an incremental backup strategy
    • AMPROD
      • non-incrementals currently running daily starting at 2:00 am - to be replaced by an incremental backup strategy
    • DWPROD
      • non-incrementals currently running daily starting at 2:00 am - to be replaced by an incremental backup strategy
    • ODS
      • more information needed

These backup files are kept on the server being backed up, located in a filesystem different from those where the datafiles are stored. A Veeam process managed by TSS copies the backup files to a backup server.

For incremental backups, the recovery window for RMAN backups is 8 days. For non-incremental backups, the recovery window is 2 days. To view all RMAN policies for a database you may do the following:

  1. As the oracle user, with the ORACLE_SID set to the desired database, enter "rman target /".
    1. At the RMAN prompt, enter "show all;"

Please note: RMAN policies are specific to each database.

In case of catastrophic failure of the primary Banner database server, it should be possible to use processes described in this document to restore Banner database services within a reasonable time frame.

Failures not requiring failover or media recovery

Some system failures, though inconvenient, are not serious enough to initiate a lengthy recovery procedure. Examples of such failures could be:

  • Momentary power outages
  • CPU failures
  • Memory failures
  • Momentary network failures
  • Operating system crashes

In such cases, as long as there is no damage to the disk storage systems and the system can be restarted immediately, there may be no need to initiate a database failover or to restore files. When the Oracle databases are restarted, they will perform instance recovery from their online redo logs, so that all committed transactions from before the failure are written to disk, and operations may continue from that point.

Physical standby databases

Oracle Data Guard is used to maintain physical standby databases for all production and test databases other than ODS. These databases are not available for regular use while the primary databases are active. However, if there is an interruption in service on the primary server (such as a lengthy power outage or network failure in the Dodge Hall data center only), the standby databases can assume the role of the primary databases within a few minutes with no loss of committed data, as long as any physical data corruption which may have occurred on the primary database server was not transmitted to the standby.

Archived redo logs are shipped across the network from the primary to the standby databases and are applied regularly. These logs are applied approximately every 30 minutes, or more often, depending on the amount of update activity occurring on the primary. The Database Administrator monitors the archived redo log application process regularly to verify that it is functioning properly.

Transfer and application of archived redo logs may be disrupted if there is a shortage of physical disk space on the standby server, or if the space reserved for recovery-related files (configured in the standby server using db_recovery_file_dest_size) becomes exhausted. This condition should be addressed by the Database Administrator as soon as possible. There are several methods that can be used to correct this condition:

  • Use RMAN to delete unneeded backup sets and archived redo logs - this is NOT recommended, since doing so may endanger the ability to recover the database in the event of an emergency.
  • Use RMAN to reduce the recovery window - this is also not recommended, since the incremental backup strategy requires up to 7 backups (one full and zero or more incrementals).
  • Increase the value of the database initialization parameter db_recovery_file_dest_size - this will help as long as there is sufficient free disk space in the location specified by the db_recovery_file_dest parameter.
  • Add more physical storage to the database server to store backup sets and archived redo logs - this will help as long as the db_recovery_file_dest_size is large enough to use the allocated storage.

If the primary database is available and undamaged when the standby database is to be activated, a “switchover” may be performed; otherwise, a “failover” must be performed.

Switchover to a standby server

If the primary databases are available and undamaged, but must be removed from service for any reason (for example: an emergency shutdown in case of air conditioning system failure), it may be possible to perform a switchover to the standby databases, if there is enough time to do so before a primary system failure.

Switchover to the standby database requires the following steps (to be repeated for every primary database on the server):

  1. Confirm that the standby database is in sync with the primary by issuing the following O/S commands as oracle on the primary database server:
    1. /home/oracle/oakdba/hourly/check_standby.shl PROD
    2. /home/oracle/oakdba/hourly/check_standby.shl AMPROD
    3. /home/oracle/oakdba/hourly/check_standby.shl DWPROD
  2. If the standby databases are in sync with their primaries, you may continue with the following steps; otherwise, the Database Administrator must take action to get them in sync.
  3. Have the network team prepare a DNS change to move the CName of the primary database server to the standby database server.
  4. Perform the following steps for each primary database on the affected server:
    1. On the primary database, issue “SELECT SWITCHOVER_STATUS FROM V$DATABASE;”
      1. If the value returned is “TO STANDBY” or “SESSIONS ACTIVE”, you may proceed with switchover. If a different value is returned, there is a problem with log transport from the primary to the standby, and a switchover would result in data loss.
    2. On the primary database, issue “ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY WITH SESSION SHUTDOWN;”
    3. On the primary database, issue “SHUTDOWN ABORT;”
    4. If you intend for the primary database to immediately assume the standby role, then on the primary database, issue “STARTUP MOUNT;”
    5. On the standby database, issue “SELECT SWITCHOVER_STATUS FROM V$DATABASE;”
      1. If the value returned is “TO PRIMARY” or “SESSIONS ACTIVE”, you may proceed with switchover;
      2. otherwise, ensure that archived redo logs from the primary database are being transported and applied.
      3. Continue to issue the above query until the value returned is “TO PRIMARY” or “SESSIONS ACTIVE”.
    6. On the standby database, issue “ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WITH SESSION SHUTDOWN;”
    7. On the new primary (formerly standby), issue “ALTER DATABASE OPEN;”
    8. On the new standby (formerly primary), issue “ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT FROM SESSION;”
    9. On the new standby, edit /etc/oratab, changing the flag for the database from “Y” to “S”.
    10. On the new primary, edit /etc/oratab, changing the flag for the database from “S” to “Y”.
  5. Have the network team immediately deploy the DNS change they prepared earlier.

Failover to a standby server

If the primary database server has already failed and/or is damaged, it is not possible to perform a switchover; however, if the standby server is available and undamaged, it is possible to perform a failover.

Failover to the standby database requires the following steps (to be repeated for every primary database on the server):

  1. Ensure that the primary databases are down.
  2. If possible, issue the O/S command “sudo touch /etc/nodatabase” on the database server. If you cannot do this immediately, then reboot the server into single user mode when you are able, and issue the above touch command before performing a regular reboot on that server.
  3. Confirm that the standby databases are in sync with their primary databases:
    1. On the standby databases, issue “SELECT UNIQUE THREAD# AS THREAD, MAX(SEQUENCE#) OVER (PARTITION BY thread#) AS LAST from V$ARCHIVED_LOG;” to determine the highest log sequence number.
    2. On the standby databases, issue “SELECT THREAD#, LOW_SEQUENCE#, HIGH_SEQUENCE# FROM V$ARCHIVE_GAP;” to determine if there are any archived redo log gaps.
    3. If any gaps are displayed, copy or restore the related archived redo logs from the low to the high sequence numbers inclusive. If you are unable to do this, a failover operation will result in data loss.
    4. Issue “ALTER DATABASE REGISTER PHYSICAL LOGFILE ‘filespec’;” for each archived redo log file you restored.
    5. Repeat from step b above until there are no gaps.
  4. Have the network team prepare a DNS change to move the CName of the primary database server to the standby database server.
  5. On the standby database, issue “ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;”
  6. On the standby database, issue “ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH;”.
  7. If there are no errors from step 3, you may continue as follows:
    1. On the standby database, issue “SELECT SWITCHOVER_STATUS FROM V$DATABASE;”. If the result of the above query is “TO PRIMARY” or “SESSIONS ACTIVE”, you may continue; otherwise, repeat this query until you see one of those two responses. If you do not see an acceptable response within a few minutes, you should open a priority 1 customer support request with Ellucian if you have not already done so.
    2. On the standby database, once you have an acceptable response from the above query, issue “ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WITH SESSION SHUTDOWN;”
  8. If you receive an error message about unresolved redo gaps from step 3, and you have already performed step 1, you can continue; however, there will be data loss. If you choose to continue, proceed as follows:
    1. On the standby database, issue “ALTER DATABASE ACTIVATE PHYSICAL STANDBY DATABASE;”
    2. Based on the unresolved redo gaps, make a note of the impact of the data loss; the Banner Operating Committee will need to be notified so that lost data might be reentered.
  9. On the new primary (previously standby) database, issue “ALTER DATABASE OPEN;”
  10. Have the network team immediately deploy the DNS change they prepared earlier.
  11. When the old primary server becomes available for service again, the Database Administrator will need to recreate standby databases there, and a switchover using the above procedure will need to be scheduled.

Oracle RMAN backups

These backups are performed daily on the database servers on the schedule described above. The RMAN utility creates backup sets from the database files and archived redo log files; these files are compressed and maintained on the database server on disk storage dedicated for that purpose. RMAN also manages purges of old backup sets and archived redo logs, based on the configuration of the database’s retention policy.

RMAN can detect and repair data corruption in the case of errors; it can also be used to manually check for such corruption using the “validate database;” command within RMAN.

In the case of a failure that requires media backup, RMAN can be used to restore lost or damaged database files individually, or can restore the entire database, to a point in time when the database was known to be stable. Any work committed after the selected point in time would be lost.

We would use these backups as the second line of defense, after it is determined that the standby database is not available for failover/switchover (due to incomplete application of redo logs, propagation of data corruption from the primary database, or other damage to the standby server or database).

The specifics of the recovery procedure depend on the type of damage that occurred (loss of entire database, loss of one or more individual data files, loss of a control file, loss of a redo log, block corruption, &c). RMAN can be used to recover the entire database, or individual files. RMAN keeps track of database failures that it detects, and can provide instructions for automatic or manual repair.

If possible, for specific instructions for recovering or repairing the database, browse to https://docs.oracle.com/en/database/database.html, select the version of the database you are using, click the “High Availability” link, and search the Backup and Recovery Users Guide for the scenario that best matches the given situation.

Some documentation for a few selected scenarios is shown below.

Full manual database recovery

If all or most database files are lost or damaged, but you still have at least one valid copy of the database’s control file, you can perform a complete recovery of the whole database as follows:

  1. On the database server, become the oracle user. When prompted for an ORACLE_SID, enter the database you wish to recover.
  2. The database to be recovered must be shut down. As the root user, issue “systemctl stop oracledb” to shut down all databases on the server.
  3. As the oracle user, issue the “rman target /” command to enter the RMAN subsystem.
    1. Issue the command “startup mount;” to start the database in the appropriate mode.
    2. Issue the command “restore database;” to restore the database from backup files.
    3. Issue the command “recover database;” to roll forward through archived redo logs.
    4. If no errors were received, issue “alter database open;” to open the database.
  4. Once all databases needing recovery have been successfully recovered, as the root user, issue “systemctl restart oracledb” to stop all the databases, and then bring them up again with Banner services active.

Database failure

If there has been data corruption, the procedure for recovering from database failure using RMAN is as follows:

  1. On the database server, become the oracle user. When prompted for an ORACLE_SID, enter the database you wish to recover.
  2. Issue the command “rman target /” to enter the RMAN subsystem.
    1. Issue the RMAN command “list failure;” to display database failures for the database. Each failure will have a unique ID number, and a status “OPEN” or “CLOSED”, and a priority “CRITICAL”, “HIGH”, or “LOW”. Some failures may be grouped together if they are similar and can be dealt with together. If there are any OPEN failures, you will need to review them. CRITICAL and HIGH priority failures must be dealt with immediately.
    2. Issue the RMAN command “advise failure;” to view instructions on how to proceed. RMAN will list manual and/or automated recovery steps. If possible, RMAN will build a script to execute some or all of the needed repairs.
    3. If any manual steps are listed, try those first; then repeat from step 3 above.
    4. If no manual steps are listed, or they cannot be performed, execute “repair failure preview;” to view the automated steps that will be run; then execute “repair failure;” to run them. Respond to any prompts you are given.
    5. After performing repairs, issue “list failure;” to confirm that all failures are now CLOSED.
    6. If you performed manual repairs, RMAN may not record the repair. You may enter “change failure <ID> closed;” if you believe that the failure has been repaired; however, the failure will be recreated with a new failure ID if a future failure validation still finds it.

DataAdminHowTo

DB_Administration