Running a cluster deprecated

This document describes how a cluster can be setup between middleware containers.

Concepts

For inter-OX-communication over the network, multiple middleware containers (core-mw) can form a cluster. This brings different advantages regarding distribution and caching of volatile data, load balancing, scalability, fail-safety and robustness. Additionally, it provides the infrastructure for upcoming features of OX App Suite. The clustering capabilities of the containers are mainly built up on Hazelcast, an open source clustering and highly scalable data distribution platform for Java. The following article provides an overview about the current featureset and configuration options.

Requirements

Synchronized system clock times

It is crucial that all involved members in a cluster do have their system clock times in sync with each other. The system clock in a container is the same as on the host machine, because it’s controlled by the kernel of that machine. As containers are managed by Kubernetes, please make sure the system clock times of all Kubernetes workers are in sync; e.g. by using an NTP service.

HTTP routing

A middleware cluster is always part of a larger picture. Usually there is a front level loadbalancer (e.g. Istio) as central HTTPS entry point to the platform. This loadbalancer optionally performs HTTPS termination and forwards HTTP(S) requests to the middleware containers.

A central requirement is that we have session stability. HTTP routing must happen such that all HTTP requests of a session end up on the same middleware container.

If you are using Istio as a loadbalancer and deployed OX App Suite with our stack chart, proper HTTP routing is already in place. The stack chart deploys multiple so-called Destination Rules. A dedicated destination rule for the HTTP API ensures, that the first request will set a routing cookie (<RELEASE_NAME>-core-mw-http-api-route). This cookie is then used to route follow-up requests to the same container that has processed the initial request.

There are several reasons why we require session stability in exactly this way. We require session stability for horizontal scale-out; while we support transparent resuming/migration of user sessions in the OX cluster without the need for users to re-authenticate, sessions wandering around randomly will consume a fixed amount resources corresponding to a running session on each middleware container in the cluster, while a session sticky to one middleware container will consume this fixed amount of resources only on one middleware container. Furthermore, there are mechanisms in OX App Suite, like TokenLogin, which work only if all requests belonging to one sequence get routed to the same middleware container, even if they originate from different machines with different IPs.

Same packages

All middleware containers participating in the Hazelcast cluster need to have the same open-xchange-* packages enabled, so that all dynamically injected class definitions are available during (de-)serialization on all containers. So for example, even if a container does not serve requests from the web client, it still requires the realtime packages for collaborative document editing or the packages for the distributed session storage being enabled.

Configuration

All settings regarding cluster setup are located in the configuration file hazelcast.properties. The following gives an overview about the most important settings - please refer to the inline documentation of the configuration file for more advanced options. A full list of Hazelcast-related properties is available here.

General

To restrict access to the cluster and to separate the cluster from others in the local network, a group name needs to be defined. The group name and password can be set in your chart's values.yaml:

core-mw:
  hzGroupName: "REPLACE_WITH_HAEZLCAST_GROUP_NAME"
  hzGroupPassword: "REPLACE_WITH_HAEZLCAST_GROUP_PASSWORD"

Only middleware containers having the same values are able to join and form a cluster.

Network

It's possible to define the network interface that is used for cluster communication via com.openexchange.hazelcast.network.interfaces. By default, the interface is restricted to the IP address of the middleware container.

To form a cluster of multiple middleware containers, different discovery mechanisms can be used. The discovery mechanism is specified via the property com.openexchange.hazelcast.network.join and defaults to kubernetes. This mode requires to define a Kubernetes service (com.openexchange.hazelcast.network.join.k8s.serviceName), which provides all Hazelcast instances.

The core-mw Helm chart will automaticall deploy such a Kubernetes service. The service is called <RELEASE_NAME>-core-mw-hazelcast-headless and all middleware containers labeled with role roles.middleware.open-xchange.com/hazelcast-data-holding=true (default), will be found by this service.

Advanced Configuration

Lite Members

Lite members in a Hazelcast cluster are members that do not hold any data partitions, i.e. all read- and write operations to distributed maps are delegated to non-lite ("full") members. Apart from not having data partitions, lite members participate in the same way as other members: they can register listeners for distributed topics (e.g. cache invalidation events) or can be addressed for task execution (e.g. during realtime communication).

Similar to using a custom partitioning scheme, separating the containers of a large cluster into few "full" members and many "lite" members helps to minimize the impact of JVM activities from a single container (mainly the garbage collector) on the whole cluster communication. Additionally, when starting or stopping lite members, no repartitioning of the distributed cluster data needs to be performed, which significantly decreases the container's startup- and shutdown time and reduces the necessary network communication to a minimum.

In medium or larger sized clusters, it is sufficient to have roughly 10 to 20 percent of the containers configured as "full" members, while all other ones can be started as "lite" member. Additionally, please note that the configured backup count in the map configurations should always be smaller than the total number of "full" members, otherwise, there may be problems if one of those data-holding containers is shut down. So, the minimum number of "full" members is implicitly bound to the sum of a map's backupCount and asyncBackupCount properties, plus 1 for the original data partition.

The configured "full" members should preferably not be used to serve client requests (by not adding them as endpoint in the loadbalancer), to ensure they are always responsive. Also, shutdown and startups of those "full" members should be reduced to a minimum to avoid repartitioning operations.

More general information regarding lite members is available here.

To configure a container as "lite" member, you can specify the role hazelcast-lite-member in your chart's values.yaml:

core-mw:
  scaling:
    nodes:
      http-api:
        replicas: 2
        roles:
          - http-api
          - hazelcast-lite-member

Please refer to the chart's values.yaml for a more detailed example.

Features

The following list gives an overview about different features that were implemented using the new cluster capabilities.

Distributed Session Storage

Previously, when an Open-Xchange server was shutdown for maintenance, all user sessions that were bound to that machine were lost, i.e. the users needed to login again. With the distributed session storage, all sessions are backed by a distributed map in the cluster, so that they are no longer bound to a specific container in the cluster. When a container is shut down, the session data is still available in the cluster and can be accessed from the remaining containers. The load-balancing techniques of the webserver then seamlessly routes the user session to another container, with no session expired errors. The distributed session storage comes with the package open-xchange-sessionstorage-hazelcast. It's recommended to enable this optional package in all clustered environments with multiple middleware containers. The package can be enabled in your chart's values.yaml:

core-mw:
  packages:
    status:
      open-xchange-sessionstorage-hazelcast: enabled

Notes:

  • While there's some kind of built-in session distribution among the containers in the cluster, this should not be seen as a replacement for session-stickiness between the loadbalancer and middleware containers, i.e. one should still configure the loadbalancer to use sticky sessions for performance reasons.
  • The distributed session storage is still an in-memory storage. While the session data is distributed and backed up on multiple containers in the cluster, shutting down multiple or all containers at the same time will lead to loss of the the distributed data. To avoid such data loss when shutting down a container, please follow the guidelines at section "Updating a Cluster".

Depending on the cluster infrastructure, different backup-count configuration options might be set for the distributed session storage in the map configuration file sessions.properties in the hazelcast subdirectory:

com.openexchange.hazelcast.configuration.map.backupCount=1

The backupCount property configures the number of containers with synchronized backups. Synchronized backups block operations until backups are successfully copied and acknowledgements are received. If 1 is set as the backup-count for example, then all entries of the map will be copied to another JVM for fail-safety. 0 means no backup. Any integer between 0 and 6. Default is 1, setting bigger than 6 has no effect.

com.openexchange.hazelcast.configuration.map.asyncBackupCount=0

The asyncBackupCount property configures the number of containers with async backups. Async backups do not block operations and do not require acknowledgements. 0 means no backup. Any integer between 0 and 6. Default is 0, setting bigger than 6 has no effect.

Since session data is backed up by default continuously by multiple containers in the cluster, the steps described in Session_Migration to trigger session migration to other containers explicitly is obsolete and no longer needed with the distributed session storage.

Normally, sessions in the distributed storages are not evicted automatically, but are only removed when they're also removed from the session handler, either due to a logout operation or when exceeding the long-term session lifetime as configured by com.openexchange.sessiond.sessionLongLifeTime in sessiond.properties. Under certain circumstances, i.e. the session is no longer accessed by the client and the middleware container hosting the session in it's long-life container being shutdown, the remove operation from the distributed storage might not be triggered. Therefore, additionally a maximum idle time of map-entries can be configured for the distributed sessions map via

com.openexchange.hazelcast.configuration.map.maxIdleSeconds=640000

to avoid unnecessary eviction, the value should be higher than the configured com.openexchange.sessiond.sessionLongLifeTime in sessiond.properties.

Remote Cache Invalidation

For faster access, groupware data is held in different caches by the server. Formerly, the caches utilized the TCP Lateral Auxiliary Cache plug in (LTCP) for the underlying JCS caches to broadcast updates and removals to caches on other middleware containers in the cluster. This could potentially lead to problems when remote invalidation was not working reliably due to network discovery problems. As an alternative, remote cache invalidation can also be performed using reliable publish/subscribe events built up on Hazelcast topics. This can be configured in the cache.properties configuration file, where the 'eventInvalidation' property can either be set to 'false' for the legacy behavior or 'true' for the new mechanism:

com.openexchange.caching.jcs.eventInvalidation=true

All containers participating in the cluster should be configured equally.

Internally, if com.openexchange.caching.jcs.eventInvalidation is set to true, LTCP is disabled in JCS caches. Instead, an internal mechanism based on distributed Hazelcast event topics is used to invalidate data throughout all containers in the cluster after local update- and remove-operations. Put-operations aren't propagated (and haven't been with LTCP either), since all data put into caches can be locally loaded/evaluated at each container from the persistent storage layer.

Using Hazelcast-based cache invalidation also makes further configuration of the JCS auxiliaries obsolete in the cache.ccf configuration file. In that case, all jcs.auxiliary.LTCP.* configuration settings are virtually ignored. However, it's still required to mark caches that require cluster-wide invalidation via jcs.region.<cache_name>=LTCP, just as before. So basically, when using the new default setting com.openexchange.caching.jcs.eventInvalidation=true, it's recommended to just use the stock cache.ccf file, since no further LTCP configuration is required.

Adminstration / Troubleshooting

Hazelcast Configuration

The underlying Hazelcast library can be configured using the file hazelcast.properties.

Important: The Hazelcast JMX MBean can be enabled or disabled with the property com.openexchange.hazelcast.jmx. The properties com.openexchange.hazelcast.mergeFirstRunDelay and com.openexchange.hazelcast.mergeRunDelay control the run intervals of the so-called Split Brain Handler of Hazelcast that initiates the cluster join process when a new container is started. More details can be found at http://www.hazelcast.com/docs/2.5/manual/single_html/#NetworkPartitioning.

The port ranges used by Hazelcast for incoming and outgoing connections can be controlled via the configuration parameters com.openexchange.hazelcast.networkConfig.port, com.openexchange.hazelcast.networkConfig.portAutoIncrement and com.openexchange.hazelcast.networkConfig.outboundPortDefinitions.

Commandline Tool

To print out statistics about the cluster and the distributed data, the showruntimestats commandline tool can be executed with the clusterstats ('c') argument. This provides an overview about the runtime cluster configuration of the container, other members in the cluster and distributed data structures.

JMX

In the Open-Xchange server Java process, the MBean com.hazelcast can be used to monitor and manage different aspects of the underlying Hazelcast cluster. The com.hazelcast MBean provides detailed information about the cluster configuration and distributed data structures.

Hazelcast Errors

When experiencing hazelcast related errors in the logfiles, most likely different versions of the packages are installed, leading to different message formats that can't be understood by containers using another version. Examples for such errors are exceptions in hazelcast components regarding (de)serialization or other message processing. This may happen when performing a consecutive update of all containers in the cluster, where temporarily containers with a heterogeneous setup try to communicate with each other. If the errors don't disappear after all containers in the cluster have been update to the same package versions, it might be necessary to shutdown the cluster completely, so that all distributed data is cleared.

Cluster Discovery Errors

  • If the started middleware containers don't form a cluster, please double-check your configuration in hazelcast.properties
  • It's important to have the same cluster name defined in hazelcast.properties throughout all containers in the cluster

Disable Cluster Features

The Hazelcast based clustering features can be disabled with the following property changes:

  • Disable cluster discovery by setting com.openexchange.hazelcast.network.join to empty in hazelcast.properties
  • Disable Hazelcast by setting com.openexchange.hazelcast.enabled to false in hazelcast.properties
  • Disable message based cache event invalidation by setting com.openexchange.caching.jcs.eventInvalidation to false in cache.properties

Updating a Cluster

Running a cluster means built-in failover on the one hand, but might require some attention when it comes to the point of upgrading the services on all containers in the cluster. This chapter gives an overview about general concepts and hints for updating of the cluster.

The Big Picture

Updating an OX App Suite cluster is possible in several ways. The involved steps always include

  • Update the software by updating the container images
  • Update the database schemas (so-called update tasks)

There are some precautions required, though.

Update Tasks Management

It is a feature of the middleware to automatically start update tasks on a database schema when a user tries to login whose context lives on that schema. For installations beyond a certain size, if you just update the middleware without special handling of the update tasks, user logins will trigger an uncontrolled storm of update tasks on the databases, potentially leading to resource contention, unnecessary long update tasks runtimes, excessive load on the database server, maybe even service outages.

We describe the update strategy in more detail in the next section. Note that these are still high-level outlines of the actual procedure, which requires additional details with regards to Hazelcast, given further down below.

Rolling strategy

It is possible to execute the update tasks decoupled from the real update of the rest of the cluster, days or even weeks ahead of time, with the following approach:

  • Start a new middleware container with the updated version that is not part of your cluster, identically configured to the other middleware containers.
  • Execute all update tasks from the update container.

In the last step, users from affected schemas will be logged out and denied service while the update tasks are running on their database schema. This is typically a short unavailability (some minutes) for a small part (1000...7000 depending on the installation) of the user base.

This way you end up with the production cluster running on the old version of OX App Suite, with the database already being upgraded to the next version. This is explicitly a valid and supported configuration. This approach offers the advantage that update tasks can be executed in advance, instead of doing them while the whole system is in a full maintenance downtime. Since update tasks can take some time, this is a considerable advantage.

For the actual upgrade of the production cluster, the remaining step is to upgrade your deployment to the latest stack chart version. This includes an update of the core-mw chart with updated container images.

Hazelcast will ensure that sessions from containers which get restarted are taken over by other containers in the cluster, so ideally this step works without losing user sessions.

Rolling Upgrade in Kubernetes

This guide assumes you are already familiar with Kubernetes and have an already running full stack deployment with e.g. by using the App Suite stack chart.

The procedure consists of a pre-update where one update container, that does not participate to the cluster, will execute the database update tasks, and a real update, where all middleware containers of the cluster will get updated to the new image version of the software.

Pre-Update

In this phase, you will have to create a new Helm release of the core-mw chart with slightly modified values compared to your full stack deployment. This will create a Kubernetes job, that starts a middleware container and runs all the update tasks.

Preparations

  • Take backups of as much as possible (databases, OX config files, etc).

Step-by-Step Guide

To create the Kubernetes job that runs the update tasks, follow these steps:

  • Download the desired version of the core-mw chart from registry.open-xchange.com and extract it to a new folder.
  • Create a new values file (e.g. upgrade-values.yaml) in the extracted directory.
  • Copy the core-mw configuration of your existing full stack deployment into the new file to ensure that the upgrade container has the same configuration in order to execute all necessary update tasks.
  • Adjust the values for the new version if necessary (please see the chart's release notes).
  • Modify values in upgrade-values.yaml:

    • Set enableInitialization: false to prevent initialization.
    • Remove the existing mysql section and add a global one that references your existing Kubernetes secret containing the access information:

      global:
        mysql:
          existingSecret: "<RELEASE>-core-mw-mysql"
      
    • Add globalDBID and set the global database ID of your deployment.

    • Enable the update section. This ensures that no deployment and only the upgrade job will be created. Optionally define the database schemata to reduce the load on the database:

      update:
        enabled: true
        schemata: "<YOUR_DATABASE_SCHEMATA>"
        image:
        tag: "<IMAGE_TAG>"
      
  • Install a new Helm release of the core-mw chart with your modified upgrade-values.yaml into the same namespace as your full stack deployment. This will allow the upgrade container to find your database and establish a successful connection. The Helm release will create a Kubernetes job and run the update tasks. Wait for the job to complete.

  • If the job has succeeded, uninstall the Helm release, which will remove the Kubernetes job and proceed with the real update. If the job failed, manually resolve the situation. Then, remove the Kubernetes job via kubectl and optionally re-run the update tasks via helm upgrade. This will create another job that executes the update tasks again.

Real Update

The real update is more simple. Simply update the core-mw chart version of your existing full stack deployment to the new version and run helm upgrade

Note: Be cautious with major chart updates, since they might require adjusting the chart values. Please see the chart's release notes for more information.

Reference Documentation

Limitations

While in most cases a seamless, rolling upgrade of all containers in the cluster is possible, there may be situations where containers running a newer version are not able to communicate with older containers in the cluster, i.e. can't access distributed data or consume incompatible event notifications - especially, when the underlying Hazelcast library is part of the update, which does not support this scenario at the moment. In such cases, the release notes will contain corresponding information, so please have a look there before applying an update.

Additionally, there may always be some kind of race conditions during an update, i.e. client requests that can't be completed successfully or internal events not being delivered to all containers in the cluster. That's why the following information should only serve as a best-practices guide to minimize the impact of upgrades to the user experience.

Upgrades of the Hazelcast library

In case an upgrade includes a major update of the Hazelcast library, a newly upgraded container will usually not be able to connect to the containers running the previous version. In this case, volatile cluster data is lost after all containers in the cluster have been updated, including sessions held in the distributed session storage. As outlined above, the release notes will contain a corresponding warning in such cases. Starting with v7.10.3, separation of the clusters during rolling upgrades is enforced using by appending a version suffix to the cluster group name.

Besides upgraded containers not being able to access distributed data of the legacy cluster, this also affects new data not being available in the legacy cluster, which may cause troubles if the updated backend version needs to perform database update tasks. Database update tasks usually operate in a "blocking" way and all contexts associated with the schema being upgraded are disabled temporarily. Since context data itself is being held in caches on potentially each container in the cluster, the affected cache entries are invalidated during the database update. And, since cluster-wide cache invalidations again utilize Hazelcast functionality (section "Remote Cache Invalidation"), such invalidations normally won't be propagated to containers running a previous version of the Hazelcast library.

Other Considerations

  • Do not stop a container while running the update tasks.
  • In case of trouble, i.e. a container refuses to join the cluster again after restart, consult the logfiles first for any hints about what is causing the problem - both on the disconnected container, and also on other containers in the cluster
  • If there are general incompatibilities between two revisions of the middleware that prevent an operation in a cluster (release notes), it's recommended to choose another name for the cluster for the containers with the new version. This will temporary lead to two separate clusters during the rolling upgrade, and finally the old cluster being shut down completely after the last container was updated to the new version. While distributed data can't be migrated from one server version to another in this scenario due to incompatibilities, the uptime of the system itself is not affected, since the containers in the new cluster are able to serve new incoming requests directly.
  • When updating only UI plugins without also updating to a new version of the core UI, you also need to perform the additional step from Updating UI plugins.