System Continuity

4 min read

last change: 9-6-2023

System continuity is essential to our clients business processes. Measures to ensure the continuity involve runtime solutions like Blue/Green deployments, failover configurations and backup strategies.

As we are using two environments that differ greatly in their nature, so are our System Continuity measures. Our classic product offering (shipitsmarter.com) is mainly hosted on own hardware in an Equinix DataCenter, where our cloud native offering (viya.me) runs in a cloud based KuberNetes solution. For our cloud native solutions we are using the system continuity options of the cloud service provider to guarentee backups.

Restoring basic functionality

In case of function failure, restoring basic functionality is executed according to our Service Level Agreement:

Restoring complete functionality

In case of function failure, restoring basic functionality is executed according to our Service Level Agreement:

Backup strategy

All backups created and stored at the system location (DataCenter) and are replicated to another physical location within 1 hour. Both backup and transport of data is encrypted according to industry standards.

Transactional data

Backup of transactional data in our DataCenter is provided according to the following scheme:

A complete backup every week.
A differential backup every day.
A transaction log update every 15 minutes.
A monthly restore to Acceptance to confirm backup health for database backups.
A retention of weekly backup files is a month on a confirmed restore and 14 1-out-of-4-weekly backups.
Backups synchronisation to an offsite location (ShipitSmarter Office) to an encrypted, secured device, transported over a dedicated VPN tunnel.

For our cloud native solution the backup will be equivalent, but depending on the service providers’ capabilities.

Virtual Machines

All virtual machines, including production, are subject to a daily backup plan, with a minimum retention of 3 days. Goal is to use a journaling backup strategy latest at January 1st 2022, where point in time snapshots are used to provide backup and restore functionality that can mitigate ransomware attacks.

Other

Creatable or calculated data will be backed-up according to a scenario that suits the Service Level Agreement.

Runtime continuity

During runtime, the infrastructure is defined in such a way that in case of a system failure an unattended failover mechanism will become active, taking over the execution of functionality while no downtime is experienced by the users.

Software deployments will be minimized by using a Blue/Green strategy for all software components where possible, where only in very specific situations downtime can not be avoided.

Disaster Recovery

In case of a major disaster, recovery can be guaranteed within 48 hours for a Equinix datacentre disaster and within 12 hours for cloud based infrastructure - given the cloud provider (Microsoft Azure) is not experiencing a major service problem.

To ensure restoring infrastructure works flawlessly, all deployments should incorporate the creation of infrastructure needed to recreate the service, when possible.

RTO

Recovery Time Objective (RTO) refers to the quantity of time that an application, system and/or process, can be down for without causing significant damage to the business as well as the time spent restoring the application and its data.

In case of a disaster, a production environment being able to run the most critical components will be created within 48 hours. In order to apply to this RTO we strive to an RTO of 4 hours. Recovery will be executed sequential over 4 priority groups. The RTO applies only to the highest priority group (critical production systems).

RPO

Recovery Point Objective (RPO) refers to the amount of data that can be lost within a period most relevant to a business, before significant harm occurs, from the point of a critical event to the most preceding backup.

The maximum amount of data that can be lost is defined in time: as we have a very short backup cycle of 15 minutes, the RPO is set to 30 minutes, meaning that in case of a disaster, the amount of data lost is the data that has been handled in the last 30 minutes or less.

published on: 9-6-2023

Prev: Software Development Lifecycle
Next: Vulnerability Management Policy