Schema Agreement Cassandra
Written by Wendy Garraty
I am looking for advice on how to improve the speed of schema tuning. Schema changes must be propagated to all nodes in the cluster. Once they have agreed on a common version, we say they agree. Alternatively, you can disable schema update flushing as follows: If a node is booting, we use a series of latches (org.apache.cassandra.service.MigrationTask#inflightTasks) to track in-flight schema extraction requests, and we do not continue with boot/stream until all latches are released (or we wait for each of them). One problem with this is that if we have a large schema or if retrieving the schema from the other nodes was surprisingly slow, we don‘t have explicit validation to make sure we actually got a schema before continuing. This may not be a direct answer to your question, but it can help things slow down with the Datastax driver compared to cqlsh. There was an inefficiency in the driver where schema updates were exposed when a client made a schema change, although this (JAVA-1120) should not delay the return time for the execution of each schema-related statement by 1 second. A schema tuning error is not fatal, but it can lead to unexpected results (as explained at the beginning of this section). Then we tried using CQLSH to connect to each node in the cluster, but we always had the same problem. On each node, Cassandra knew the table and we could see the schema definition for the table, but when we tried to query her, we got the following error: To avoid this problem, the driver waits for all nodes to agree on a common schema version: If you are running a cluster with different major/minor server versions (for example.
B, Cassandra 2.1 and 2.2), the schema agreement will never succeed. This is because the way the schema version is calculated changes between versions, so nodes report different versions when they actually match (see JAVA-750 for technical details). Often, I get the “Got schema agreement after” output printed on the console and the repetitions can go up to 300. That‘s what doesn‘t make sense to me. How could a schema change take so long to run on a system that is essentially idle? Connect to one of the nodes in the second schema list. For this example, we can select the node “10.111.22.102″; I added the following code to execute a statement and then wait for the schema agreement to get a little more information about the wait time. Note that onKeyspaceAdded/onKeyspaceDropped is called on your schema listeners for newly included/excluded keyspaces when you edit the list at run time. Users who want to be notified of schema changes can implement the SchemaChangeListener interface. Now that you have completed the above steps on each node, all nodes must be in one schema: in our case, we had exactly three nodes in each schema. In this case, it is more likely that the nodes in the first schema are those selected by Cassandra during a schema negotiation, so try the following instructions for one of the nodes in the second schema list.
Although it is possible to increase the “migration_task_wait_in_seconds” to force the node to wait longer for each latch, there are cases where this does not help because the callbacks for schema extraction requests from the mail service callback mapping (org.apache.cassandra.net.MessagingService#callbacks) expired after request_timeout_in_ms (by default 10 seconds) were able to respond to the new nodes before the other nodes. There was our problem! We had a pattern disagreement! Three nodes in our six-node cluster were in a different schema: each time the schema metadata was disabled and re-enabled (via configuration or API), an update is triggered immediately. If you need to track schema changes, you do not need to query the metadata manually. Instead, you can save a listener to be notified when changes occur: if it doesn‘t work, it means that the other schema is the one that Cassandra set as authority, so repeat these steps for the list of nodes in the first list of schemas. Metadata#getKeyspaces returns a client-side representation of the database schema: we found a StackOverlow article that suggests that one solution to the schema inconsistency problem was to change nodes one by one. We tried that and it worked. Below are the steps that worked for us. Schema metadata is completely immutable (both the map and all the objects it contains). It represents a snapshot of the database at the time of the last metadata update and is consistent with the token mapping of the parent metadata object.
Remember that the metadata itself is immutable. If you need to get the most recent schema, be sure to revisit session.getMetadata().getKeyspaces() (not just getKeyspaces() for an outdated metadata reference). If everything went well, you should see that the node “10.111.22.102″ has been moved to the other list of schemas (Note: The list of nodes is not sorted by IP): We checked DataStax, which had the article Managing schema disagreements. However, their official documentation was scarce and assumed that a node was not accessible. Note that it is best not to register a listener until the cluster is fully initialized, otherwise the listener could be notified by a large number of “added” events when the driver first creates schema metadata from scratch. As explained above, the driver waits for the schema agreement after running a schema change query. This ensures that subsequent requests (which can be routed to different nodes) display a current version of the schema. Schema contract maintenance runs synchronously, so the run-time call (or resultSetFuture completion if you are using the asynchronous API) is not returned until the operation is complete.
Validation is implemented by repeatedly querying the system tables for the schema version reported by each node until they all converge to the same value. If this does not happen within a certain period of time, the driver stops waiting. The default timeout is 10 seconds, it can be customized when you create your cluster: this problem would be difficult to solve reliably and should not be such a big problem in practice anyway: If you are in the middle of a continuous upgrade, you will probably not apply schema changes at the same time. This is done by querying the system tables for the schema version of all nodes that are currently located. If all versions match, the check succeeds, otherwise it is repeated periodically until a certain expiration time. This process is adjustable in the driver configuration: now that Cassandra is secure, run the describe cluster command again to determine if the node has passed to the other schema: If there are more nodes in one schema than in the other, you can first try to restart and check a Cassandra node in the smaller list, if it joins the other list of schemas. After you execute a statement, you can use ExecutionInfo#isSchemaInAgreement to verify that the schema agreement has succeeded or expired: github.com/apache/cassandra/commit/08450080614250a8bfaba23dbca741a4d9315e3c Because of the distributed nature of Cassandra, schema changes made on one node may not be immediately visible to others. If this is not resolved, it could lead to input requirements if successive requests are passed to different coordinators: I have a Cassandra 2.x cluster with 3 nodes and a ready-to-use configuration. My Java program uses the datastax driver (2.19) to create the schema by executing CQL/DDL statements one by one. What I find is that schema changes often take tens of seconds for schema compliance for multiple statements. The statement, which takes a lot of time, is quite random. After recording, the listener is notified of any schema changes detected by the driver, regardless of where they came from.
This hotfix verifies the consistency of the schema between the boot node and the rest of the active nodes before booting. It also adds a check to prevent the new node from flooding existing nodes with concurrent schema retrieval requests, as can be the case in large clusters. Note: — If I create the diagram with cqlsh, it‘s pretty fast. You can also perform a requirements check at any time using Session#checkSchemaAgreementAsync (or its synchronous counterpart): for estimateNumberOfSplits, you need a way to estimate the total number of partition keys (this is what scan clients would traditionally do with the Thrift describe_splits_ex operation). Starting with Cassandra 2.1.5, this information is available in a system table (see CASSANDRA-7688). There are unit tests as part of the PR test of this feature. Let‘s review the log to make sure Cassandra has been successfully restarted, removing the locking system should also prevent new nodes in large clusters from getting stuck for long periods of time while waiting for “migration_task_wait_in_seconds” on each of the latches that remain orphaned by the timed reminders. If you are able to do this, upgrading to java-driver 3.0.3+ will solve this problem for you. The CASSANDRA-16732 node cannot be replaced if the cluster is in a mixed major version See SchemaChangeListener for the list of available methods.
SchemaChangeListenerBase is a handy implementation with empty methods if you only need to replace some of them. .