The Bare Minimum

The simplest storage solution is to have one node with one single drive that users can read from and write into.

The biggest problems with this, is that one single disk might not be enough to hold all of the users data


Adding more disks

Adding more disks within the node helps with handling large amounts of data.

Now there a two ways to store data in a scenario where we have more than 1 disk,

  1. We can store a single piece of data in one disk
  2. We can split each piece of data btw disks (first 4 bytes in one disk, next four in the second, etc.)

The advantage of splitting data btw disks is that the data read and data write speeds are proportional to the split factor (parallelization). The issue however is that even if a single disk fails, all data in all other disks become useless.


Adding replicas

Adding replica disks help restoring failed disks.

Another advantage with adding mirrored disks is increased read performance.

Although we have to write to two disks whenever we write data, the write performance is kept the same due to parallelization.

But this means that we need to have too many disks within the node without more storage capacity.


Parity disks

Parity disks store a calculation btw all the disks within the system, that way if a subset of the disks fails we can use the remaining disks to recompute the lost data.

Ex: IF we have a node with 17 data disks and 3 parity disks, we can handle simultaneous failures in up to 3 disks.


Sharding

The single-node systems discussed above can only handle so much data as supported by the max number of disks that can be attached to a node.

Sharding is used to split the data across nodes to overcome this hardware limitation

But what if an entire node fails? Or what if multiple node fails?


sharding with replicas


Sharding with parity nodes

We can have nodes dedicated to store data’s parity so that when entire node fail, we can restore it based on the parity calculations

Refs