#UniConfig Clustering
#Introduction
UniConfig can be easily deployed in a cluster owing to its stateless architecture and transaction isolation:
- Stateless architecture - UniConfig nodes in the cluster do not keep any state that needs to be communicated directly to other UniConfig nodes in the cluster. All network-topology configuration and state information is stored inside a PostgreSQL database that must be reachable from all UniConfig nodes in the same zone. All UniConfig nodes share the same database, making the database the single source of truth for the cluster.
- Transaction isolation - Load-balancing is based on mapping UniConfig transactions to UniConfig nodes in a cluster (transactions are sticky). One UniConfig transaction cannot span multiple UniConfig nodes in a cluster. Southbound sessions used for device management are ephemeral, i.e., they are created when UniConfig needs to access a device on the network (like pushing configuration updates) and are closed as soon as a UniConfig transaction is committed or closed.
There are several advantages to clustered deployment of UniConfig nodes:
- Horizontal scalability - Increasing the number of units that can process UniConfig transactions in parallel. A single UniConfig node tends to have limited processing and networking resources - this constraint can be mitigated by increasing the number of nodes in the cluster. The more UniConfig nodes in the cluster, the more transactions can be executed in parallel. The number of connected UniConfig nodes in the cluster can also be adjusted at runtime.
- High availability - A single UniConfig node does not represent a single point of failure. If a UniConfig node crashes, only transactions processed by the corresponding node are cancelled. The application can retry a failed transaction, which will be processed by the next node in the cluster.
There also are a couple limitations to be considered:
- Parallel execution of transactions is subject to a locking mechanism, whereby two transactions cannot manipulate the same device at the same time.
- A single transaction is always executed by a single UniConfig node. This means that the scope of a single transaction is limited by the number of devices and their configurations that a single Uniconfig node can handle.
#Deployments
#Single-zone deployment
In a single-zone deployment, all managed network devices are reachable by all UniConfig nodes in the cluster zone. Components of a single-zone deployment and connections between them are illustrated in the diagram below.
Included components:
- UniConfig controllers - Network controllers that use a common PostgreSQL system for persistence of data, communicate with network devices using NETCONF/GNMi/CLI management protocols and propagate notifications into Kafka topics (UniConfig nodes act only as Kafka producers). UniConfig nodes do not communicate with each other directly, their operations can only be coordinated using data stored in the database.
- Database storage - PostgreSQL is used for persistence of network-topology configuration, mountpoints settings and selected operational data. The PostgreSQL database can also be deployed in the cluster (outside of scope).
- Message and notification channels - The Kafka cluster is used to propagate notifications generated by UniConfig itself (e.g., audit and transaction notifications) or from network devices and only propagated by UniConfig controllers.
- Load-balancers - Load-balancers are used to distribute transactions (HTTP traffic) and SSH sessions from applications to UniConfig nodes. From the point of view of the load-balancer, all UniConfig nodes in the cluster are equal. Currently only a round-robin load-balancing strategy is supported.
- Managed network devices - Devices that are managed using NETCONF/GNMi/CLI protocols by UniConfig nodes or generate notifications to UniConfig nodes. Sessions between UniConfig nodes and devices are either on-demand/emphemeral (configuration of devices) or long-term (distribution of notifications over streams).
- HTTP/SSH clients and Kafka consumers - Application layer, such as workflow management systems or end-user systems. The RESTCONF API is exposed using the HTTP protocol, the SSH server exposes UniConfig shell and Kafka brokers allow Kafka consumers to listen to events on subscribed topics.
#Multi-zone deployment
This type of deployment has multiple zones that manage separate sets of devices for the following reasons:
- Network reachability issues - Groups of devices are reachable, and thus manageable, only from some part of the network (zone) but not from others
- Logical separation - Different scaling strategies or requirements for different zones.
- Legal issues - Some devices must be managed separately with split storage because of, for example, regional restrictions.
The following diagrams represents a sample deployment with two zones:
- Zone 1 contains three UniConfig nodes.
- Zone 2 contains only two UniConfig nodes.
Multiple zones can share a single Kafka cluster, but database instances must be be split (can be running in a single Postgres server).
Description of multi-zone areas:
- Applications - The application layer is responsible for managing mapping between network segments and UniConfig zones. Typically this is achieved by deploying/using an additional inventory database that contains device <-> zone mappings, and based on this information the application decides which zone to use.
- Isolated zones - A zone contains one or more UniConfig nodes, load-balancers and managed network devices. Clusters in isolated zones do not share information.
- PostgreSQL databases - It is necessary to use a dedicated database per zone.
- Kafka cluster - A Kafka cluster can be shared by multiple clusters in different zones, or alternatively there can be a single Kafka cluster per zone. Notifications from different zones can be safely pushed to common topics, as there can be no conflicts between Kafka publishers. However, it is also possible to achieve isolation of published messages in a shared Kafka deployment by setting different topic names in different zones.
#Load-balancer operation
Load-balancers are responsible for allocating UniConfig transactions to one of the UniConfig nodes in the cluster. This is done by forwarding requests without a UniConfig transaction header to one of the UniConfig nodes (using round-robin strategy) and afterwards appending a backed identifier to the create-transaction RPC response in the form of an additional Cookie header ('sticky session' concept). Afterwards, the application is responsible for assuring that all requests belonging to the same transaction contain the same backend identifier.
The application is responsible for preserving transaction and backend identifier cookies throughout the transaction's lifetime.
The following sequence diagram captures the process of creating and using two UniConfig transactions with a focus on load-balancer operation.
- The first create-transaction RPC is forwarded to the first UniConfig node (applying round-robin strategy), because it does not contain the
uniconfig_server_id
key in the Cookie header. The response contains both the UniConfig transaction ID (UNICONFIGTXID
) anduniconfig_server_id
that represents a 'sticky cookie'. Theuniconfig_server_id
cookie header is appended to the response by the load-balancer. - The next request that belongs to the created transaction contains the same
UNICONFIGTXID
anduniconfig_server_id
. The load balancer usesuniconfig_server_id
to forward this request to the correct UniConfig node. - The last application request again represents the create-transaction RPC. This time the request is forwarded to the next registered UniConfig node in the cluster according to the round-robin strategy.
#Configuration
#UniConfig configuration
All UniConfig nodes in the cluster should be configured using the same parameters. There are several important sections in the config/application.properties file that relate to the clustered environment.
#Database connection settings
This section contains information on how to connect to a PostgreSQL database as well as connection pool settings. It is located under the dbPersistence.connection
properties object.
Example with essential settings:
Make sure that [number of UniConfig nodes in cluster] * [maxDbPoolSize] does not exceed the maximum allowed number of open transactions and open connections on the PostgreSQL side. Note that maxDbPoolSize
also caps the maximum number of open UniConfig transactions (1 UniConfig transaction == 1 database transaction == 1 connection to database).
#UniConfig node identification
By default, UniConfig node names are generated randomly. This behavior can be modified by setting db-persistence.uniconfig-instance.instance-name
. The instance name is leveraged, for example, in the clustering of stream subscriptions.
Example:
#Kafka and notification properties
This section contains properties related to connections to Kafka brokers, Kafka publisher timeouts, authentication, subscription allocation and rebalancing settings.
Example with essential properties:
#Load-balancer configuration
The following YAML code represents a sample Traefik configuration that can be used in the clustered UniConfig deployment (deployment with a single Traefik node). There is one registered entry-point with the uniconfig
identifier on port 8181.
Next, you need to configure UniConfig docker containers with Traefik labels. UniConfig nodes are automatically detected by a Traefik container as uniconfig
service providers. There is also a URI prefix (/rests
), the name of the "sticky cookie" (uniconfig_server_id
) and the server port number (8181
) where the UniConfig web server listens to incoming HTTP requests.
Values for all Traefik labels should be the same on all nodes in the cluster. Scaling of the UniConfig service in the cluster (for example, using Docker Swarm tools) is simple when container settings do not change.
A similar configuration to the one presented using Traefik can also be achieved using other load-balancer tools, such as HAProxy.
#Clustering of NETCONF subscriptions and notifications
When a device is installed with the stream property set, subscriptions for all provided streams are created in database. These subscriptions are always created with the UniConfig instance id set to null, so they can be acquired by any UniConfig from the cluster.
Each UniConfig instance in a cluster uses its own monitoring system to acquire free subscriptions. The monitoring system uses a specialized transaction to lock subscriptions, which prevents other UniConfig instances from locking the same subscriptions. While locking a subscription, the UniConfig instance writes its id into the subscription table for the currently locked subscription, which indicates that this subscription is already acquired by this UniConfig instance. Other instances of UniConfig will then see that this subscription is not available.
#Optimal subscription count and rebalancing
With multiple UniConfig instances working in a cluster, each instance calculates an optimal range of subscriptions to manage.
Based on the optimal range and number of currently opened subscriptions, each UniConfig node (while performing a monitoring system iteration) decides whether it should:
- Acquire additional subscriptions until the optimal range is reached.
- Once the optimal range is reached, stay put and not acquire any additional subscriptions.
- Release some of its subscriptions to trigger rebalancing until the optimal range is reached.
If an instance goes down, all of its subscriptions are immediately released and the optimal ranges of other living nodes are adjusted. Managed network devices, and thus subscriptions, are reopened by the rest of the cluster.
Note that there is a grace period before other nodes take over the subscriptions. Therefore, if a node goes down and then back up in quick succession, it will restart the subscriptions on its own.
The following example illustrates a timeline for a three-node cluster and how many subscriptions are handled by each node:
The hard limit still applies in clustered environments. It is never exceeded regardless of the optimal range.