This document specifies basic Apache Cassandra configuration and integration with Databricks Platform. The document does not cover setting up an AWS machine nor AWS VPC network configuration. As a result, there are several prerequisites before you begin work with Cassandra.
Ensure, that all tasks specified below have been done.
9042
(Cassandra default port for CQL). You can find instructions here.9042.
Open port 9094
on local firewall:
iptables -A INPUT -i eth0 -p tcp --dport 9042 -j ACCEPT
Modify cassandra.yaml
to setup remote access to Cassandra cluster:
listen_address: localhost
rpc_address: localhost
broadcast_rpc_address: xx.xx.xx.xx
Where xx.xx.xx.xx
is elastic IP assigned to EC2 machine. More information available on this stack overflow article.
sudo service cassandra restart
Navigate to shared
folder on your Databricks workspace and add new library. Select PyPi
source for the library and type cassandra-driver
as package name.
Create library. The library will now be installed on your cluster automatically. You can verify progress and status of the installation on the panel.
Download Cassandra spark connector; file spark-cassandra-connector_2.11-2.3.1.jar
- exact version. You can download jar from here,
Install spark-cassandra-connector_2.11-2.3.1.jar
library manualy on Databricks.
Python code should be run from Databricks Spark cluster, with following specification:
Databricks Runtime Version: 4.3 (includes Apache Spark 2.3.1, Scala 2.11). Python version 2.
Note, that xx.xx.xx.xx
is elastic IP assigned to EC2 machine.
d = spark.read.format("org.apache.spark.sql.cassandra").options(host='xx.xx.xx.xx', table="trans",keyspace="test").load()