Backups: Part 5 – Process Dependencies – Databases Example

It is a frequent occurrence for a backup of a database system to be de-coupled from the tape backup such that there is a risk that an over-run or failure of the database backup schedule would not be detected by the subsequent schedule of the tape backup.

I recommend that the database and tape backups always be coupled such that they can be considered one composite job where the backup is dependent upon the application being in a suitable state for a backup.

If scheduled from the UNIX cron scheduler or application scheduler itself should call a process which calls a backup once the system is known to be in a suitable state defined within the backup script.

If scheduled from the backup software, the database dump, quiesce, or shutdown should be scheduled as a backup pre/post command.

Verify that the backup destination is available – in the case of disk – make sure it is mounted and writable, and in the case of tape, make sure it is loaded and writable.

The backup-pre command should not be run if the backup media cannot be verified as available

The backup-pre command should bring the system to a safe state

The backup-post command should bring the system to an open state

The backup job should not be initiated if the backup-pre command fails

The backup-post command should be run whether or not the backup-pre command or backup command itself fails

The backup-post command should return success if the program is already running satisfactorily

Any media movement should be checked and performed prior to entering the backup-pre process.

In the event of a media movement error, the backup-pre process should not be run, nor should the backup be run.

The pre and post commands should be attempted multiple times to mask over transient errors. Something like the following code fragment is sufficient for providing a 3-strike attempt:-

SRC="/some/list/of/app/dirs"
DST="/path/to/backup/device"
RETVAL=1
try_thrice() {
  $@ || $@ || $@
}
backup_pre() {
  /path/to/some/app/command_to_freeze
}
do_backup() {
  tar cf - $SRC | bzip2 -9c - > $DST
}
backup_post() {
  /path/to/some/app/command_to_thaw
}
try_thrice backup_pre || exit $?
try_thrice do_backup
RETVAL=$?
try_thrice backup_post
exit `expr $? + $RETVAL`

So…in a nutshell:-

To improve data security and backup capacity management – a database backup should be linked to the tape backup such that the newly created database backup is copied to tape at the earliest opportunity, and that the tape backup should be configured so that it is not run if the database backup-pre command fails.

Backups: Part 4 – Dropped Files

In this little brain-splat I bring to you another backup woe – “Dropped Files”. These are an area of backup which is frequently overlooked. Many are just concerned on whether or not the backup job finished and succeeded. Many times I have seen backup reports where 1000’s of files are dropped daily as a matter of course due to lack of content exclusion base-lining.

All backups should be recursively bedded-in on initial configuration until they run for at least 99% of the time with 0 dropped files.

The danger of dropped files is that if you accept it as the norm Рyou will miss the critical files when they happen РOnly through striving to maintain 0 dropped files through appropriate exclusions can it be possible to meet an absolute criteria of a good backup and enable you to see the real backup exceptions when they happen.

Dropped files are a critical part of the assessment on whether a backup is good so that makes it a mandatory process to eliminate any hot files and directories which are not required for a DR such as temp files and spool files. Elimination of these sources also reduces the backup payload thus reducing your backup times but also your RTO too as there is less data to restore.

RPO and RTO

RPO is the “recovery point objective”, the acceptable amount of data loss ¬†from the point of failure.

RTO is the “recovery time objective”, the acceptable amount of time available to restore data within the RPO from the time of the point of failire,

The RTO is often dictated by an SLA or KPI or the like and often this is unrealistic in the event of a real disaster scenario

The RPO is often dictated by the backup policy, it should instead be dictated by the SLA as a data loss acceptance level from the business.If a system is backed up once per day then the RPO is automatically 24 hours.

Backup Basics (Don’t throw the baby out with the bath-water!)

In part 2 of my backup basics snippets, I introduce you to the concept of “not throwing the baby out with the bath-water” and what this means in backup-terms is that you do not delete your last-good backups before you have secured your next backup.

It seems for many so tempting to just wrench the retention period down to 0 and obliterate what might be in some cases a situation where a non-zero retention is the only valid recovery plan from the previous backup failure.

The mitigation of this risk has a capacity consequence, which should be calculated at the time of system design. The mistake illustrated here is most frequently encountered on database systems where there is often insufficient space to back-up the database once it breaches 50% of it’s data LUN.

If the data set is expected to be up to s Gb and c cycles are required then the backup volume or device needs to have the capacity of at least (c * s)+s Gb in order to hold sufficient copies of a backup; preferably on cheaper, albeit slower storage.

Do not skimp on backup storage if you want your DBAs to love you, they hate application-based backup agents and will give their left sock for a flat-file on-disk backup any-day!

Most importantly – it will also improve your RTO and if archive logs are available it will improve your RPO too.

The thing is to think ahead and always provision at least (c * s) + s storage for the backup volume

Keep this golden rule and your backup solution should remain fit for purpose given the finite storage resource available to the production data itself.