About Ceph

12 min readFeb 2, 2024

What is Ceph?

CRUSH: Controlled Replication Under Scalable Hashing.

Ceph eliminates the need for centralized metadata and can distribute the load across all the nodes in the cluster.

CRUSH calculates data placement based on table lookups and can scale to hundreds of petabytes without the risk of bottlenecks and the associated SPoF.

Clients also form direct connections with the required OSDs, which also eliminates any single points becoming bottlenecks. (how?)

Ceph provides three main types of storage:

Block storage via the RADOS Block Device (RBD).
File storage via CephFS.
Object storage via RADOS Gateway.

How does Ceph work?

The core storage layer in Ceph is the Reliable Autonomous Distributed Object Store (RADOS), which, as the name suggests, provides an object store on which the higher-level protocols are built.

The RADOS layer in Ceph consists of a number of object storage daemons (OSDs).

Each OSD is completely independent and forms peer-to-peer relationships to form a cluster.

Each OSD is typically mapped to a single disk.

The monitors are responsible for forming a cluster quorum via the use of Paxos.

The monitors are not directly involved in the data path and do not have the same performance requirements as OSDs. They are mainly used to provide a known cluster state, including membership, via the use of various cluster maps.

Cluster maps are used by both Ceph cluster components and clients to describe the cluster topology and enable data to be safely stored in the right location.

The manager is responsible for configuration and statistics.

An algorithm called CRUSH is used to place the placement groups onto the OSDs.

Librados is a Ceph library that can be used to build applications that interact directly with the RADOS cluster to store and retrieve objects.

Object Storage

Object storage, the idea is for relatively low-performance storage at a relatively low cost that is designed to serve the needs of the Internet workload (web applications, website hosting, and delivering content across the Internet but also a way to find a new home for all of the data that we used to store on tape).

What is an Object?

The concept of object storage is that you have an object. Now, that object can be any kind of file. There are 4 essential components to that object that make it usable in the computing sense (for every object):

ID: to have some sort of unique identifier that lets us know what this object is when it comes to retrieving it.
Data: any data like Excel, video.
Metadata: Metadata is everything that you need to know about this file and about the data itself (who created it? When was it created? How large is it?…).
Attributes: attributes are related to metadata. They aren’t exactly the same thing, but they’re close. attributes can be certain users are allowed to override/download/delete it. So attributes are about the object itself rather than about the data.

Buckets

All objects are going to buckets. These buckets can be as big as you need them to be.

They can scale to hold billions of objects.

You will interact with buckets via an API.

When you put your object into a bucket (the bucket is virtual), and if we have 3 physically separated devices, then it is going to take a copy of your object and replicate it out into all 3 physical devices, and the purpose of replicating it, is data integrity and data security.

What is BlueStore?

BlueStore is a Ceph object store that’s primarily designed to address the limitations of filestore.

Block + NewStore = BlewStore = BlueStore

NewStore was a combination of RocksDB, a key-value store that stored metadata, and a standard Portable Operating System Interface (POSIX) filesystem for the actual objects.

BlueStore is designed to remove the double write penalty associated with filestore and improve performance that can be obtained from the same hardware. Also, with the new ability to have more control over the way objects are stored on disk, additional features, such as checksums and compression, can be implemented.

How does BlueStore work?

The following diagram shows how BlueStore interacts with a block device. Unlike filestore, data is directly written to the block device, and metadata operations are handled by RocksDB:

The block device is divided between RocksDB data storage and the actual user data stored in Ceph. Each object is stored as a number of blobs allocated from the block device.

RocksDB contains metadata for each object and tracks the utilization and allocation information for the data blobs.

RocksDB

RocksDB is a high-performance key-value store that was originally forked from LevelDB.

RocksDB is used to store metadata about the stored object, which was previously handled using a combination of LevelDB and XATTRs in filestore.

A key characteristic of RocksDB is the way in which data is written down at the levels of the database. New data is written into a memory-based table with an optional transaction log on persistent storage, the WAL; as this memory-based table fills up, data is moved down to the next level of the database by a process called compaction. When that level fills up, data is migrated down again, and so on. All of these levels are stored in what RocksDB calls SST files. In Ceph, each of these levels is configured to be 10 times the size of the previous level.

All new data is written into the memory-based table and WAL, the memory-based table is known as level 0. BlueStore configures level 0 as 256 MB. The default size multiplier between levels is a factor of ten, this means that level 1 is also 256 MB, level 2 is 2.56 GB, level 3 is 25.6 GB, and level 4 would be 256 GB. For most Ceph use cases the average total metadata size per OSD should be around 20–30GB, with the hot data set typically being less than this. It would be hoped that levels 0, 1, and 2 would contain most of the hot data for writes, so sizing an SSD partition to at least 3 GB should mean that these levels are stored on SSD.

RADOS Pools and Client Access

RADOS pools are the core part of a Ceph cluster. Creating a RADOS pool is what drives the creation and distribution of the placement groups, which themselves are the autonomous part of Ceph. Two types of pools can be created, replicated, and erasure-coded, offering different usable capacities, durability, and performance. RADOS provides and manages different storage solutions to clients via RBD, CephFS, and RGW.

Replicated pools

Replicated RADOS pools are the default type in Ceph.

Data is received by the primary OSD from the client and then replicated to the remaining OSDs.

In writing, the primary OSD gets the data and writes it on the replicated nodes and gets the confirmation if they have got the data or not and be synced.

In reading, only the primary OSD node answers the client’s request.

Erasure code pools

Ceph’s default replication level provides excellent protection against data loss by storing three copies of your data on different OSDs. However, storing three copies of data vastly increases both the purchase cost of the hardware and the associated operational costs, such as power and cooling.

What is erasure coding?

Erasure coding allows Ceph to achieve either greater usable storage capacity or increase resilience (recovery) to disk failure for the same number of disks.

Erasure coding achieves this by splitting up the object into a number of parts and then also calculating a type of CRC.

The erasure code stores data in separate parts and each part is stored on a separate OSD. These parts are referred to as K and M chunks, where K refers to the number of data shards and M refers to the number of erasure code shards.

Ceph can use the erasure codes to mathematically recreate the data from a combination of the remaining data and erasure code shards.

Recommended K+M is 4+2, this configuration would give you 66% usable capacity and allows for two OSD failures.

Reading back from these high-chunk pools is also a problem. Unlike in a replica pool, where Ceph can read just the requested data from any offset in an object, in an erasure pool, all shards from all OSDs have to be read before the read request can be satisfied.

How does erasure coding work in Ceph?

In erasure code, Ceph has the concept of primary OSD too.

The primary OSD has the responsibility of communicating with the client, calculating the erasure shards, and sending them out to the remaining OSDs in the PG set.

If an OSD in the set is down, the primary OSD can use the remaining data and erasure shards to reconstruct the data, before sending it back to the client. During read operations, the primary OSD requests all OSDs in the PG set to send their shards. The primary OSD uses data from the data shards to construct the requested data, and the erasure shards are discarded. There is a fast read option that can be enabled on erasure pools, which allows the primary OSD to reconstruct the data from erasure shards if they return quicker than data shards. This can help to lower average latency at the cost of a slightly higher CPU usage. The following diagram shows how Ceph reads from an erasure-coded pool:

The following diagram shows how Ceph reads from an erasure pool when one of the data shards is unavailable. Data is reconstructed by reversing the erasure algorithm, using the remaining data and erasure shards:

PLACEMENT GROUPS

Placement groups (PGs) are an internal implementation detail of how Ceph distributes data.

Autoscaling provides a way to manage PGs, and especially to manage the number of PGs present in different pools. When pg-autoscaling is enabled, the cluster is allowed to make recommendations or automatic changes with respect to the number of PGs for each pool (pgp_num) in accordance with expected cluster utilization and expected pool utilization.

Each pool has a pg_autoscale_mode property that can be set to off, on, or warn:

off: Disable autoscaling for this pool.
on: Enable automated changes of the PG count for the given pool.
warn: Raise health checks when the PG count is in need of change.

OSD

Ceph-OSD is the object storage daemon for the Ceph distributed file system. It is responsible for storing objects on a local file system and providing access to them over the network. It manages data on local storage with redundancy and provides access to that data over the network.

The datapath argument should be a directory on an XFS file system where the object data resides.

Crush

The Crush algorithm computes storage locations to determine how to store and retrieve data.

CRUSH allows Ceph clients to communicate with OSDs directly rather than through a centralized server or broker.

CRUSH uses a map of the cluster (CRUSH map) to map data to OSD. https://ceph.io/assets/pdfs/weil-crush-sc06.pdf

CRUSH maps contain a list of OSDs and a hierarchy of “buckets” (hosts, racks) and rules that govern how CRUSH replicates data within the cluster’s pools.

CRUSH’s decision is based on the failure domain. If you determine the failure domain as OSDs, it won’t distribute the PGs on the same OSDs or if you select host as failure domain, it won’t place the PGs in the same host, it distributes on the different hosts as it is the duty to eliminates the single point of failure.

When OSDs are deployed, they are automatically added to the CRUSH map under a host bucket that is named for the node on which the OSDs run. This behavior, combined with the configured CRUSH failure domain, ensures that replicas or erasure-code shards are distributed across hosts and that the failure of a single host or other kinds of failures will not affect availability. For larger clusters, administrators must carefully consider their choice of failure domain. For example, distributing replicas across racks is typical for mid- to large-sized clusters.

The CRUSH location for an OSD can be modified by adding the crush location option in ceph.conf. When this option has been added, every time the OSD starts it verifies that it is in the correct location in the CRUSH map and moves itself if it is not. To disable this automatic CRUSH map management, add the following to the ceph.conf configuration file in the [osd] section:

osd crush update on start = false

The cluster name is typically Ceph, the id is the daemon identifier or (in the case of OSDs) the OSD number, and the daemon type is osd, mds, mgr, or mon.

“Bucket”, in the context of CRUSH, is a term for any of the internal nodes in the hierarchy: hosts, racks, rows, and so on. The CRUSH map defines a series of types that are used to identify these nodes.

Each node (device or bucket) in the hierarchy has a weight that indicates the relative proportion of the total data that should be stored by that device or hierarchy subtree.

The weight of the root node will be the sum of the weights of all devices contained under it. Weights are typically measured in tebibytes (TiB).

Rules

CRUSH rules define policy governing how data is distributed across the devices in the hierarchy. The rules define placement as well as replication strategies or distribution policies that allow you to specify exactly how CRUSH places data replicas.

https://docs.ceph.com/en/quincy/rados/operations/crush-map/#crush-maps

Ceph-MON

Ceph-mon is the cluster monitor daemon for the Ceph distributed file system. One or more instances of ceph-mon form a Paxos part-time parliament cluster that provides extremely reliable and durable storage of cluster membership, configuration, and state.

Ceph-MGR

The Ceph Manager daemon (ceph-mgr) runs alongside the monitor daemon, to provide additional monitoring and interfaces to external monitoring and management systems.

By default, the manager daemon requires no additional configuration, beyond ensuring it is running. If there is no mgr daemon running, you will see a health warning to that effect, and some of the other information in the output of Ceph status will be missing or stale until a mgr is started.

Use your normal deployment tools, such as ceph-ansible or cephadm, to set up ceph-mgr daemons on each of your mon nodes. It is not mandatory to place mgr daemons on the same nodes as mons, but it is almost always sensible.

RGW

Cephadm deploys radosgw as a collection of daemons that manage a single-cluster deployment or a particular realm and zone in a multisite deployment. https://docs.ceph.com/en/latest/radosgw/multisite/#multisite

The RGW module provides a simple interface to deploy the RGW multisite. It helps with bootstrapping and configuring RGW real, zonegroup, and the different related entities.

RBD Mirroring

RBD images can be asynchronously mirrored between two Ceph clusters. This capability is available in two modes:

Journal-Base
Snapshot-Base

Journal-Base

This mode uses the RBD journaling image feature to ensure point-in-time, crash-consistent replication between clusters. Every write to the RBD image is first recorded to the associated journal before modifying the actual image. The remote cluster will read from this associated journal and replay the updates to its local copy of the image. Since each write to the RBD image will result in two writes to the Ceph cluster, expect write latencies to nearly double while using the RBD journaling image feature.

Snapshot-Base

This mode uses periodically scheduled or manually created RBD image mirror snapshots to replicate crash-consistent RBD images between clusters. The remote cluster will determine any data or metadata updates between two mirror snapshots and copy the deltas to its local copy of the image. With the help of the RBD fast-diff image feature, updated data blocks can be quickly determined without the need to scan the full RBD image. Since this mode is not as fine-grained as journaling, the complete delta between two snapshots will need to be synced prior to use during a failover scenario. Any partially applied set of deltas will be rolled back at the moment of failover.

You can configure mirroring per pool or subset of some RBD images. rbd-mirror daemon is responsible to pull image updates and save them on the local cluster. This image is an example of rbd-mirror structure:

Depending on the desired needs for replication, RBD mirroring can be configured for either one or two-way replication:

One-way Replication: When data is only mirrored from a primary cluster to a secondary cluster, the rbd-mirror daemon runs only on the secondary cluster. Data can be replicated on the other sites too.
Two-way Replication: When data is mirrored from primary images on one cluster to non-primary images on another cluster (and vice-versa), the rbd-mirror daemon runs on both clusters. In this scenario, the rbd-mirror daemon runs on both clusters to facilitate the replication of data in both directions. This ensures that changes made to the primary images on one cluster are mirrored to the non-primary images on the other cluster, and vice versa, allowing for bidirectional synchronization of data between the two clusters.