Claiming ownership of the freed self..: September 2016

Spark work on a master - slave model.
Master Machine and Slave Machines required. You can have master and slave on same machine too.
For testing purpose you can generate the cluster on same machine , approach similar
In each machine check if Java is installed

JAVA install/update
If you are using a new machine, ec2 it wont come preinstalled with Java.
You can set up Java following below steps
First update package index :
1. sudo apt-get update
2. check version java -version
3. If Java not found → Install JRE
  1. sudo apt-get install default-jre
  2. sudo apt-get install default-jdk
check Java-version
This might install Java 1.6
sudo apt-get install openjdk-7-jre
sudo apt-get isntall openjdk-7-jdk
Issue : Java still points to version 6.
Follow below steps to update
1. update java-alternatives -l
2. sudo update-java-alternatives ~~s java~~1.7.0-openjdk-amd64 (your path to the newer)
Instance Setup
Login to Dev or EC2 instances
If Running on Ec2 Do the following:
1. Create an account, generate private and public key pairs download the keys to one of the machine. You can get those steps online on how to generate key pairs.
2. Create two system variables for AWS_ACCESS_KEY and AWS_SECRET_KEY
3. Go to spark directory on your local machine and run the ec2 script from ./sbin/ec2 file.
4. ./spark-ec2 --key-pair=awskey --identity-file=awskey.pem --region=us-west-1 --zone=us-west-1a launch my-spark-cluster
5. This will create two Large instances cluster.
SetUp Spark
Download Spark → Select Hadoop latest Version and follow the link
For dev box : you can download using :

wget spark link
unzip using tar -xzf folder for spark

Do this in all machines you want to run spark on .
Set up env variable SPARK_HOME to point folder just before the bin directory, in normal case it will be the unzipped folder
Save it using source ./bashrc
Decide on which machine you want your master and slave

Master Machine
Login to Master Machine, follow below steps
edit hosts file for master
1. vi /etc/hosts
2. Provide the link between IP and name
3. IP Machine1
  IP Machine 2
Next few steps are changes in spark code
1. slaves template is located in the conf directory.
2. there is a file called slave template, same file can be used or a new file called you can create from that buy copying the content to slaves
3. Paste the names of all slaves in this file
Go to Spark Sbin Directory
1. cd ./sbin/
2. run ./sbin/start-master.sh
This will start the master on port 8080, access it via IP address or Machine Name
For current case go to any web browser and type : http://machinename:8080
This was required because you need to get the spark master address. When you go to the link it will show the url at the top spark://machinename:7077
Copy it
Go to /conf/spark-defaults.conf.template
Add the masters url in the file
spark.master spark://machinename:7077 (it can also start with IP or a bigger name like machinename:7077
Go to /sbin/spark-env.sh
Add the following lines to it
export SPARK_WORKER_MEMORY=1g You can specify how much memory you wanna give to worker

export SPARK_WORKER_INSTANCES=2 → No of workers
export SPARK_MASTER_IP=spark-2 → Name of Spark master

Slave
Go to slave Machine /sbin/start-slave.sh
Run JPS in both slaves and master

Note : You need a password less access between master and copy masters pub key defined in : id_rsa.pub move to authorized_key in slave and save.

Very Useful links

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/

http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/

https://www.codementor.io/spark/tutorial/spark-python-rdd-basics

Claiming ownership of the freed self..

Wednesday, 7 September 2016

Setting up spark on a multicluster Machine