Hey Everybody,
Could someone please help me in understanding what is going on? I've just installed Datastax Amazon AMI (DataStax Auto-Clustering AMI (ami-814ec2e8) - 2.4). I've installed it with the following options: --clustername thoth-cluster --totalnodes 6 --version community --opscenter yes. I'll try to give you as much details as possible. Please, tell me if I missed something. I've tried it thrice, always with same result:
1) All machines belongs to the same security group (Thoth-Cassandra) and the firewall rules are as follow. I've removed the group ID (but not the name) to make it more clear:
ICMP:
ALL 0.0.0.0/0
TCP:
1024 - 65535 (Thoth-Cassandra)
7000 (Thoth-Cassandra)
7199 (Thoth-Cassandra)
9160 (Thoth-Cassandra)
61620 (Thoth-Cassandra)
61621 (Thoth-Cassandra)
22 (SSH) 0.0.0.0/0
8888 0.0.0.0/0
3) After following the installation doc with care I've tried SSHing the #1 node and got this:
Waiting for nodetool...
The cluster is now in it's finalization phase. This should only take a moment...
Note: You can also use CTRL+C to view the logs if desired:
AMI log: ~/datastax_ami/ami.log
Cassandra log: /var/log/cassandra/system.log
Doens't matter how much I wait it would not go away. After a CTRL+C I got this python stacktrace:
^CTraceback (most recent call last):
File "datastax_ami/ds4_motd.py", line 196, in <module>
run()
File "datastax_ami/ds4_motd.py", line 187, in run
waiting_for_nodetool()
File "datastax_ami/ds4_motd.py", line 84, in waiting_for_nodetool
retcode = subprocess.call(shlex.split(config_data['nodetool_statement']), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
File "/usr/lib/python2.7/subprocess.py", line 493, in call
return Popen(*popenargs, **kwargs).wait()
File "/usr/lib/python2.7/subprocess.py", line 1291, in wait
pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
File "/usr/lib/python2.7/subprocess.py", line 478, in _eintr_retry_call
return func(*args)
KeyboardInterrupt
I did the same on two of the five remaining nodes and got the same. I've tried waiting as long as 10 minutes.
4) Opscenter was up and running on the first node, but it could not find the agents in the remaining nodes. From opscenter I've installed the missing agents after entering the amazon internal IPs for the cluster. After that everything seems to work fine but the cluster is not balanced at all.
5) When I rung "nodetool status" I got the following:
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.114.121.165 110.69 KB 256 33.9% 0d27a009-011a-40a4-90c1-fdb908f45192 rack1
UN 10.32.63.148 100.96 KB 256 30.0% 903ca0b2-738a-4c1a-8c5f-7cb1931c97af rack1
UN 10.195.218.16 89.48 KB 256 31.6% 8c08a193-9b20-4d61-94bd-6e52c710868e rack1
UN 10.242.82.224 96.34 KB 256 34.9% ec77b08e-171d-49ae-8e2c-1372fe642fab rack1
UN 10.85.118.55 100.98 KB 256 34.3% e414e095-e6e1-4305-801a-2895d688e93e rack1
UN 10.118.137.162 98.35 KB 256 35.2% f1a67b95-95c3-4181-826d-e357849c7258 rack1
Everything seems fine, right?
6) But the weird thing is:
ubuntu@ip-10-32-63-148:~$ nodetool ring|wc -l
1544
ubuntu@ip-10-32-63-148:~$ nodetool ring|awk '{print $1}'|sort|uniq -c
3
1 ==========
256 10.114.121.165
256 10.118.137.162
256 10.195.218.16
256 10.242.82.224
256 10.32.63.148
256 10.85.118.55
ubuntu@ip-10-32-63-148:~$ ps xau|grep -i cassandraDaemon|grep -v grep|wc -l
2
So, there are 2 cassandraDaemon process running on each node and "nodetool ring" gives me 256 instances for node. Does it seems right? Am I missing something?
7) Ami.log doesn't show any error. Following are some lines from it (the file is 522 lines long, so I've removed most of it)
(...)
gpg: WARNING: options in `/root/.gnupg/gpg.conf' are not yet active during this run
(...)
gpg: no ultimately trusted keys found
(...)
[INFO] Seed list: set([u'10.242.82.224'])
[INFO] OpsCenter: 10.242.82.224
[INFO] Options: {'username': None, 'cfsreplication': None, 'heapsize': None, 'reflector': None, 'clustername': 'thoth-cluster', 'analyticsnodes': 0, 'seed_indexes': [0, 6, 6], 'realtimenodes': 6, 'opscenter': 'yes', 'totalnodes': 6, 'searchnodes': 0, 'opscenterinterface': None, 'version': 'community', 'dev': None, 'release': None, 'password': None, 'email': None, 'raidonly': None, 'javaversion': None}
[INFO] cassandra.yaml configured.
[INFO] opscenterd.conf not configured since conf was unable to be located.
[INFO] opscenter/thothcluster.conf not configured since opscenter was unable to be located.
[INFO] cassandra-env.sh configured.
(...)
[INFO] Clear "invalid flag 0x0000 of partition table 4" by issuing a write, then running fdisk on each device...
[INFO] Confirming devices are not mounted:
(...)
8) Cassandra logs also shows no errors.
9) Opscenter logs shows no error either.
All log files are almost exactly the same in all nodes.
I would really appreciate if someone tell me if I did something wrong and what I did wrong.
Thanks again,
Domingos