Sometimes, the lines between replication, backup, and archive are blurred and sometimes the terms are erroneously used interchangeably. It’s not just semantics. While each method involves copying data, they have distinctly different functions. Think of them as short-term, medium-term, and long-term, respectively. Allow us to explain.
Backup vs Replication
Backup is the process of making a secondary copy of data that can be restored to use if the primary copy becomes lost or unusable. Backups usually comprise a point-in-time copy of primary data taken on a repeated cycle – daily, monthly or weekly.
Sometimes backups can be used to roll back a set of data/files to a previous point in time as part of the maintenance or upgrade process. Backups can also be used as a virtual machine cloning tool.
Backup may be required in the following scenarios:
- Logical corruption – Data can become corrupted through application software bugs, storage software bugs or hardware failure, such as a server crash.
- User error – An end user may delete a file or directory, a set of emails or even records from an application and subsequently need the data again.
- Hardware failure – Failure scenarios can include hard disk drive (HDD) or flash drive failure (multiple failures can cause data loss even when RAID is used), server failure or storage array failure.
- Hardware loss – Possibly the worst scenario is an event such as fire that renders hardware inoperable and permanently unrecoverable.
Remote data replication is sometimes assumed to be equivalent to backup, but this is not the case.
Replication solutions can be either synchronous or asynchronous, meaning transfer of data to a remote copy is achieved either immediately or with a short time delay. Both methods create a secondary copy of data identical to the primary copy, with synchronous solutions achieving this in real time.
This means that any data corruption or user file deletion is immediately (or very quickly) replicated to the secondary copy, therefore making it ineffective as a backup method.
Another point to remember with replication is that only one copy of the data is kept at the secondary location. This means that the replicated copy doesn’t include historical versions of data from preceding days, weeks and months, unlike a backup.
Archiving vs Backup
The distinction between backup and archive is often blurred but should be very clear.
Backups are made at least daily, leave the original data in place, and have the aim of protecting data against technology failure or human error over relatively short periods, such as weeks or months.
Archiving, on the other hand, is the retention of data for lengthy periods, usually years, sometimes decades, and moves the data from its primary location.
Greg Schulz, senior advisory analyst at StorageIO, explains: “Backup is for restoring a file, object, database, volume or system based on some recovery time objective and recovery point objective, whereas the archive is a picture of the data and its state at a point in time.”
Schulz highlights key characteristics of archiving systems. These include: “Indexing and metadata management for search, replication, cloning, secure shred, Worm (write-once read-many), along with compliance or regulatory items.”
In addition, archiving includes movement of data off production storage systems onto the archive medium, driven by retention policies. “Data mover tools may be tightly or loosely integrated with the destination or target devices and in some cases even have overlapping features,” says Schulz.
The third component which does not attract as much awareness is the most important, however – how the data mover tools integrate with different applications, which need to be configured to use rules or policies to archive the data, or present it to the data mover.
Another element of the distinction can also be the medium. Media used for backup need to be able to ingest vast quantities of data quickly during a limited time window. As a result, disk rather than tape has increasingly been used for the added performance it provides, as well as providing faster access times to recently backed-up data.
Archives, on the other hand, have increasingly become tape-based, which offers the advantage of being cheap and robust over long periods of time, while the fairly slow speed of recovery is rarely a problem as occurrences are rare. This also allows time for the long process of indexing and creating metadata.
For an organization that uses backup software to archive data to tape and then store those tapes off-site, retrieving data involves a number of steps.
Tapes that contain the required data need to be identified, retrieved from off-site storage, and then mounted and read and possibly deleted. All these operations can be problematic, especially when reading tapes that may be several years old. Are they still readable or has the medium degraded? Are hardware and software still compatible? And how long does it take to find the data?
With those obstacles surmounted, a rich set of metadata is required to find the relevant information, especially if it was created over a considerable period of time, as a large number of files in multiple formats will need to be examined.
Archiving systems can help resolve many of these issues. Rich metadata enables identification of the correct tapes and ensure the required data is quickly retrieved, while tape libraries ensure tapes are regularly refreshed to avoid bit rot.