Schema Agreement Cassandra

Written by

I am looking for advice on how to improve the speed of schema tuning. Schema changes must be prop­a­gated to all nodes in the cluster. Once they have agreed on a common ver­sion, we say they agree. Alter­na­tively, you can dis­able schema update flushing as fol­lows: If a node is booting, we use a series of latches (org.apache.cassandra.service.MigrationTask#inflightTasks) to track in-​​flight schema extrac­tion requests, and we do not con­tinue with boot/​stream until all latches are released (or we wait for each of them). One problem with this is that if we have a large schema or if retrieving the schema from the other nodes was sur­pris­ingly slow, we don‘t have explicit val­i­da­tion to make sure we actu­ally got a schema before con­tin­uing. This may not be a direct answer to your ques­tion, but it can help things slow down with the Datastax driver com­pared to cqlsh. There was an inef­fi­ciency in the driver where schema updates were exposed when a client made a schema change, although this (JAVA-​​1120) should not delay the return time for the exe­cu­tion of each schema-​​related state­ment by 1 second. A schema tuning error is not fatal, but it can lead to unex­pected results (as explained at the begin­ning of this sec­tion). Then we tried using CQLSH to con­nect to each node in the cluster, but we always had the same problem. On each node, Cas­sandra knew the table and we could see the schema def­i­n­i­tion for the table, but when we tried to query her, we got the fol­lowing error: To avoid this problem, the driver waits for all nodes to agree on a common schema ver­sion: If you are run­ning a cluster with dif­ferent major/​minor server ver­sions (for example.

B, Cas­sandra 2.1 and 2.2), the schema agree­ment will never suc­ceed. This is because the way the schema ver­sion is cal­cu­lated changes between ver­sions, so nodes report dif­ferent ver­sions when they actu­ally match (see JAVA-​​750 for tech­nical details). Often, I get the “Got schema agree­ment after” output printed on the con­sole and the rep­e­ti­tions can go up to 300. That‘s what doesn‘t make sense to me. How could a schema change take so long to run on a system that is essen­tially idle? Con­nect to one of the nodes in the second schema list. For this example, we can select the node “10.111.22.102″; I added the fol­lowing code to exe­cute a state­ment and then wait for the schema agree­ment to get a little more infor­ma­tion about the wait time. Note that onKeyspaceAdded/​onKeyspaceDropped is called on your schema lis­teners for newly included/​excluded key­spaces when you edit the list at run time. Users who want to be noti­fied of schema changes can imple­ment the SchemaChange­Lis­tener inter­face. Now that you have com­pleted the above steps on each node, all nodes must be in one schema: in our case, we had exactly three nodes in each schema. In this case, it is more likely that the nodes in the first schema are those selected by Cas­sandra during a schema nego­ti­a­tion, so try the fol­lowing instruc­tions for one of the nodes in the second schema list.

Although it is pos­sible to increase the “migration_​task_​wait_​in_​seconds” to force the node to wait longer for each latch, there are cases where this does not help because the call­backs for schema extrac­tion requests from the mail ser­vice call­back map­ping (org.apache.cassandra.net.MessagingService#callbacks) expired after request_​timeout_​in_​ms (by default 10 sec­onds) were able to respond to the new nodes before the other nodes. There was our problem! We had a pat­tern dis­agree­ment! Three nodes in our six-​​node cluster were in a dif­ferent schema: each time the schema meta­data was dis­abled and re-​​enabled (via con­fig­u­ra­tion or API), an update is trig­gered imme­di­ately. If you need to track schema changes, you do not need to query the meta­data man­u­ally. Instead, you can save a lis­tener to be noti­fied when changes occur: if it doesn‘t work, it means that the other schema is the one that Cas­sandra set as authority, so repeat these steps for the list of nodes in the first list of schemas. Metadata#getKeyspaces returns a client-​​side rep­re­sen­ta­tion of the data­base schema: we found a Stack­Overlow article that sug­gests that one solu­tion to the schema incon­sis­tency problem was to change nodes one by one. We tried that and it worked. Below are the steps that worked for us. Schema meta­data is com­pletely immutable (both the map and all the objects it con­tains). It rep­re­sents a snap­shot of the data­base at the time of the last meta­data update and is con­sis­tent with the token map­ping of the parent meta­data object.

Remember that the meta­data itself is immutable. If you need to get the most recent schema, be sure to revisit session.getMetadata().getKeyspaces() (not just getKey­spaces() for an out­dated meta­data ref­er­ence). If every­thing went well, you should see that the node “10.111.22.102″ has been moved to the other list of schemas (Note: The list of nodes is not sorted by IP): We checked DataStax, which had the article Man­aging schema dis­agree­ments. How­ever, their offi­cial doc­u­men­ta­tion was scarce and assumed that a node was not acces­sible. Note that it is best not to reg­ister a lis­tener until the cluster is fully ini­tial­ized, oth­er­wise the lis­tener could be noti­fied by a large number of “added” events when the driver first cre­ates schema meta­data from scratch. As explained above, the driver waits for the schema agree­ment after run­ning a schema change query. This ensures that sub­se­quent requests (which can be routed to dif­ferent nodes) dis­play a cur­rent ver­sion of the schema. Schema con­tract main­te­nance runs syn­chro­nously, so the run-​​time call (or result­Set­Fu­ture com­ple­tion if you are using the asyn­chro­nous API) is not returned until the oper­a­tion is complete.

Val­i­da­tion is imple­mented by repeat­edly querying the system tables for the schema ver­sion reported by each node until they all con­verge to the same value. If this does not happen within a cer­tain period of time, the driver stops waiting. The default timeout is 10 sec­onds, it can be cus­tomized when you create your cluster: this problem would be dif­fi­cult to solve reli­ably and should not be such a big problem in prac­tice anyway: If you are in the middle of a con­tin­uous upgrade, you will prob­ably not apply schema changes at the same time. This is done by querying the system tables for the schema ver­sion of all nodes that are cur­rently located. If all ver­sions match, the check suc­ceeds, oth­er­wise it is repeated peri­od­i­cally until a cer­tain expi­ra­tion time. This process is adjustable in the driver con­fig­u­ra­tion: now that Cas­sandra is secure, run the describe cluster com­mand again to deter­mine if the node has passed to the other schema: If there are more nodes in one schema than in the other, you can first try to restart and check a Cas­sandra node in the smaller list, if it joins the other list of schemas. After you exe­cute a state­ment, you can use ExecutionInfo#isSchemaInAgreement to verify that the schema agree­ment has suc­ceeded or expired: github​.com/​a​p​a​c​h​e​/​c​a​s​s​a​n​d​r​a​/​c​o​m​m​i​t​/​0​8​4​5​0​0​8​0​6​1​4​2​5​0​a​8​b​f​a​b​a​2​3​d​b​c​a​7​4​1​a​4​d​9​3​1​5​e3c Because of the dis­trib­uted nature of Cas­sandra, schema changes made on one node may not be imme­di­ately vis­ible to others. If this is not resolved, it could lead to input require­ments if suc­ces­sive requests are passed to dif­ferent coor­di­na­tors: I have a Cas­sandra 2.x cluster with 3 nodes and a ready-​​to-​​use con­fig­u­ra­tion. My Java pro­gram uses the datastax driver (2.19) to create the schema by exe­cuting CQL/​DDL state­ments one by one. What I find is that schema changes often take tens of sec­onds for schema com­pli­ance for mul­tiple state­ments. The state­ment, which takes a lot of time, is quite random. After recording, the lis­tener is noti­fied of any schema changes detected by the driver, regard­less of where they came from.

This hotfix ver­i­fies the con­sis­tency of the schema between the boot node and the rest of the active nodes before booting. It also adds a check to pre­vent the new node from flooding existing nodes with con­cur­rent schema retrieval requests, as can be the case in large clus­ters. Note: — If I create the dia­gram with cqlsh, it‘s pretty fast. You can also per­form a require­ments check at any time using Session#checkSchemaAgreementAsync (or its syn­chro­nous coun­ter­part): for esti­mateNum­berOf­S­plits, you need a way to esti­mate the total number of par­ti­tion keys (this is what scan clients would tra­di­tion­ally do with the Thrift describe_​splits_​ex oper­a­tion). Starting with Cas­sandra 2.1.5, this infor­ma­tion is avail­able in a system table (see CASSANDRA-​​7688). There are unit tests as part of the PR test of this fea­ture. Let‘s review the log to make sure Cas­sandra has been suc­cess­fully restarted, removing the locking system should also pre­vent new nodes in large clus­ters from get­ting stuck for long periods of time while waiting for “migration_​task_​wait_​in_​seconds” on each of the latches that remain orphaned by the timed reminders. If you are able to do this, upgrading to java-​​driver 3.0.3+ will solve this problem for you. The CASSANDRA-​​16732 node cannot be replaced if the cluster is in a mixed major ver­sion See SchemaChange­Lis­tener for the list of avail­able methods.

SchemaChange­Lis­tener­Base is a handy imple­men­ta­tion with empty methods if you only need to replace some of them. .

Comments are closed.