The Bare Minimum
The simplest storage solution is to have one node with one single drive that users can read from and write into.
The biggest problems with this, is that one single disk might not be enough to hold all of the users data
Adding more disks
Adding more disks within the node helps with handling large amounts of data.
Now there a two ways to store data in a scenario where we have more than 1 disk,
- We can store a single piece of data in one disk
- We can split each piece of data btw disks (first 4 bytes in one disk, next four in the second, etc.)
The advantage of splitting data btw disks is that the data read and data write speeds are proportional to the split factor (parallelization). The issue however is that even if a single disk fails, all data in all other disks become useless.
Adding replicas
Adding replica disks help restoring failed disks.
Another advantage with adding mirrored disks is increased read performance.
Although we have to write to two disks whenever we write data, the write performance is kept the same due to parallelization.
But this means that we need to have too many disks within the node without more storage capacity.
Parity disks
Parity disks store a calculation btw all the disks within the system, that way if a subset of the disks fails we can use the remaining disks to recompute the lost data.
Ex: IF we have a node with 17 data disks and 3 parity disks, we can handle simultaneous failures in up to 3 disks.
Sharding
The single-node systems discussed above can only handle so much data as supported by the max number of disks that can be attached to a node.
Sharding is used to split the data across nodes to overcome this hardware limitation
But what if an entire node fails? Or what if multiple node fails?
sharding with replicas
Sharding with parity nodes
We can have nodes dedicated to store data’s parity so that when entire node fail, we can restore it based on the parity calculations