HarperDB Clustering Overview
Outline: Explainer. Describe what it is, how it works, what the requirements are. Network connectivity, HDB queuing, clustering role/user, channels
This article outlines HarperDB clustering behavior and terminology. This functionality is used to create a distributed data management platform. Clustering effectively creates a network of HarperDB instances. Clustering up to three nodes is available with the free version of HarperDB, above that requires a registered version.
HarperDB clustering requires two or more installations of HarperDB. Additionally, established network connectivity is required for initial configuration of the node. Once nodes are added to the Cluster, network connectivity can be intermittent and HarperDB will automatically handle queueing of transactions and data flow.
A single instance/installation of HarperDB constitutes a node. A node of HarperDB can operate independently with clustering on or off. Each HarperDB node encapsulates the core HarperDB server as well as a WebSocket server which facilitates distributed communication between HarperDB nodes.
A group of two or more HarperDB nodes that have been connected together via WebSockets become a Cluster. A connection establishes a pipe between two HarperDB nodes, but it does not define data movement.
Schemas in a Cluster
All Schema metadata is automatically shared across nodes as they are added to a Cluster. In other words; schema, table, and attribute definitions are identical across a Cluster. Note, if HarperDB nodes have their own schema definitions before being added to the Cluster, then the synchronization will happen bidirectionally.
Clustering Users and Roles
Inter-node authentication takes place via HarperDB users. There is a special role type defined for clustering users that limits the user to only accessing proper clustering functionality.
HarperDB users and roles are independent per instance and are initially set when installing HarperDB. There is no limit of clustering users, however HarperDB-to-HarperDB clustering must use the same credentials. For Custom Pub/Sub Connectors we recommend creating an integration specific user on the HarperDB nodes that will be interacting with the integration.
Subscriptions are defined in node configuration where they define what data moves where. Subscriptions are exclusively table level and operate independently. In order for data to fully move across a large Cluster, a grouping of subscriptions across nodes will need to be created. Channels, Publish, and Subscribe are all settings within a Subscription. See below definitions for more information about each. Subscriptions can also be referred to as pub/sub functionality.
Channels are a unique namespace used exclusively for publishing to and subscribing from the related HarperDB tables through Clustering. Channels utilize the following naming convention: schema:table. Channels represent a single table within a schema.
A single directionality data flow that will push table transactions from one HarperDB node to another across a clustering connection. Transactions are published only after a transaction has completed on the local node.
A single directionality data flow that listens for table transactions on another HarperDB node. When a transaction completes on the other node that transaction is then sent to the subscriber node where it will be executed upon receipt.
HarperDB topology refers to the arrangement of Nodes and connections within a Cluster. HarperDB topologies are infinitely flexible and are defined node-to-node through configuration functions.
Custom connectors can be created using WebSockets to interface with a HarperDB Cluster.
HarperDB has built-in resiliency for when network connectivity is lost within a subscription. When connections are reestablished, a catchup routine is executed to ensure data that was missed, specific to the subscription, is sent/received as defined.