|
|
Pentaho Data Integration: WebSpoon on AWS Elastic Beanstalk and adding EBS or EFS Storage Volumes
This is a long overdue artilce on Hiromu’s WebSpoon. Hiromu has done a fantastic work on WebSpoon - finally bringing the familiar Spoon Desktop UI to the web browser.
For completeness sake I took the liberty of copying Hiromu’s instructions on how the set up the intial AWS Elastic Beanstalk environment. My main focus here is to provide simple approaches on how to add persistant storage options to your WebSpoon setup, some of which are fairly manual approaches (which should be later on replaced by a dedicated automatic setup). The article is more aimed toward users which are new to AWS.
Note: Once finished, always remember to terminate your AWS environment to stop occuring costs.
This guide will give you an example of how to deploy webSpoon to the cloud with the following requirements.
To easily satisfy the requirements, this example uses the Docker image with the plugins and deploys it to AWS Elastic Beanstalk.
These are the rough steps using the new Amazon Configuration Interface:

In the Choose Environment dialog pick Web server environment.
In the Platform section, choose Multi-container Docker as a Preconfigure platform.

Dockerfile to provision our machine(s), we still need a method of orchestrating the setup of our cluster. This is where the Dockerrun.aws.json file comes in. In the Application code section, tick Upload your code and choose Dockerrun.aws.json as an Application code - contents copied below for convenience: "AWSEBDockerrunVersion": 2, "containerDefinitions": [ "name": "webSpoon", "image": "hiromuhota/webspoon:latest-full", "essential": true, "memory": 1920, "environment": [ "name": "JAVA_OPTS", "value": "-Xms1024m -Xmx1920m" ], "portMappings": [ "hostPort": 80, "containerPort": 8080 ] ] 

t2.micro to t2.small or an instance type with 2GB+ memory. Click Save.



http://<your-beanstalk-app-url>.com:8080/spoon/spoon Your Beanstalk App URL is shown on the AWS Beanstalk application overview page.
The main aim of adding volumes is to persist the data outside of the Docker Container. We will have a look at various options:
Sources:
Note: Choosing this option does not really provide much benefit: We only map Docker container folders to local folders on the EC2 instance. So if you were to terminate your Beanstalk environment, the files would be gone as well. The main benefit here is that if the Docker Container gets terminated, the file at least survice on the EC2 instance.
Create a new project folder called beanstalk-with-ec2-instance-mapping.
Inside this folder create a Dockerrun.aws.json with following content:
"AWSEBDockerrunVersion": 2, "volumes": [ "name": "kettle", "host": "sourcePath": "/var/app/current/kettle" , "name": "pentaho", "host": "sourcePath": "/var/app/current/pentaho" ], "containerDefinitions": [ "name": "webSpoon", "image": "hiromuhota/webspoon:latest-full", "essential": true, "memory": 1920, "environment": [ "name": "JAVA_OPTS", "value": "-Xms1024m -Xmx1920m" ], "portMappings": [ "hostPort": 80, "containerPort": 8080 ], "mountPoints": [ "sourceVolume": "kettle", "containerPath": "/root/.kettle", "readOnly": false , "sourceVolume": "pentaho", "containerPath": "/root/.pentaho", "readOnly": false , "sourceVolume": "awseb-logs-webSpoon", "containerPath": "/usr/local/tomcat/logs" ] ] First we create two volumes on the EC2 instance using the top level volumes JSON node: one for the .kettle files and one for the .pentaho files. Note that the sourcePath is the path on the host instance.
Note: This defines volumes on the hard drive of the EC2 instance you run your Docker container on. This is pretty much the same as if you were defining volumes on your laptop for a Docker container that you run. This does not set up magically any new EBS or EFS volumes.
Next, within the containerDefinitions for webSpoon, we add three mountPoints within the Docker Container. Here we map the container paths to the volumes we created earlier on (kettle and pentaho). The third mount point we define is for writing the logs out: This is a default requirement of the Beanstalk setup. For each container Beanstalk will create automatically a volume to store the logs. The volume name is made up of awseb-logs- plus the container name: In our case, this is: awseb-logs-webSpoon. And the logs we want to store are the Tomcat server logs.
The Beanstalk environment setup procedure is exactly the same as before, so go ahead and set up the environment.
Note: On the EC2 instance the directory
/var/app/current/is where the app files get stored (in our case this is onlyDockerrun.aws.json). This folder does not required sudo privileges. If you ran the Docker container on your laptop you might have noticed that by default Docker stored name volumes in/var/lib/docker/volumes. On the EC2 instance this directory requires sudo privileges.
Once the environment is running, we can ssh into the EC2 instance.
Note: You can find the public DNS of your EC2 instance via the EC2 panel of the AWS console.
It is beyond the scope of this article to explain how to set up the required key pairs to ssh into an EC2 instance: Here is a good article describing the required steps. If you want to ssh into your instance, read this first. You also have to make sure that your Beanstalk environment knows which key you want to use. You can configure this via the main Configuration Panel under Security. This will restart the EC2 instance.
ssh -i <path-to-pem-file> [email protected]<ec2-instance-public-dns> We can double check now that the volume directories got created:
$ ls -l /var/app/current/ total 12 -rw-r--r-- 1 root root 1087 Dec 26 09:41 Dockerrun.aws.json drwxr-xr-x 2 root root 4096 Dec 26 09:42 kettle drwxr-xr-x 3 root root 4096 Dec 26 09:43 pentaho Sources:
Note: An EBS Drive is a device that will be mounted directly to your EC2 instance. It cannot be shared with any other EC2 instance. In other words, every EC2 instance will have their own (possibly set of) EBS Drive(s). This means that files cannot be shared across EC2 instances.
For the next steps to work the volume mapping from Docker container to the EC2 instance has to be in place (as discussed in the previous section). We cover this below again.
Basically we have to create two layers of volume mapping:
There is no way to define an EBS volume in the Dockerrun.aws.json file: You have to create another file with a .config extension, which has to reside in the .ebextensions folder within your project’s folder. So the project’s folder structure should be like this:
. ├── Dockerrun.aws.json └── .ebextensions └── options.config If you are not familiar with how mounting drivers on Linux works, read this article first.
Important: Configuration files must conform to YAML formatting requirements. Always use spaces to indent and don’t use the same key twice in the same file.
On the EC2 instance we will create a mount point under a new /data directory, which has less chance to interfere with any other process.
Let’s get started:
Create a new project folder called beanstalk-with-ebs-two-volumes.
Inside this folder create a Dockerrun.aws.json with following content:
"AWSEBDockerrunVersion": 2, "volumes": [ "name": "kettle", "host": "sourcePath": "/data/kettle" , "name": "pentaho", "host": "sourcePath": "/data/pentaho" ], "containerDefinitions": [ "name": "webSpoon", "image": "hiromuhota/webspoon:latest-full", "essential": true, "memory": 1920, "environment": [ "name": "JAVA_OPTS", "value": "-Xms1024m -Xmx1920m" ], "portMappings": [ "hostPort": 80, "containerPort": 8080 ], "mountPoints": [ "sourceVolume": "kettle", "containerPath": "/root/.kettle", "readOnly": false , "sourceVolume": "pentaho", "containerPath": "/root/.pentaho", "readOnly": false , "sourceVolume": "awseb-logs-webSpoon", "containerPath": "/usr/local/tomcat/logs" ] ] Inside your project directory (beanstalk-with-ebs-two-volumes), create a subdirectory called .ebextensions.
In the .ebextensions directory create a new file called options.config and populate it with this content:
commands: 01mkfs: command: "mkfs -t ext3 /dev/sdh" 02mkdir: command: "mkdir -p /data/kettle" 03mount: command: "mount /dev/sdh /data/kettle" 04mkfs: command: "mkfs -t ext3 /dev/sdi" 05mkdir: command: "mkdir -p /data/pentaho" 06mount: command: "mount /dev/sdi /data/pentaho" option_settings: - namespace: aws:autoscaling:launchconfiguration option_name: BlockDeviceMappings value: /dev/sdh=:1,/dev/sdi=:1 These instructions basically format our two external volumes and then mount them. Note that at the very end under option_settings we specified that each EBS volume should be 1 GB big (which is very likely quite a bit too much for the pentaho volume, however, this is the minimum you can define).
Now we have to zip our files from within the project root directory:
[[email protected] beanstalk-with-ebs-two-volumes]$ zip ../webspoon-with-ebs.zip -r * .[^.]* adding: Dockerrun.aws.json (deflated 65%) adding: .ebextensions/ (stored 0%) adding: .ebextensions/options.config (deflated 43%) Note: The zip file will be conveniently placed outside the project directory.
Next via the Web UI create a new Beanstalk environment. The approach is the same as before, just that instead of the Dockerrun.aws.json you have to upload the zip file now.
Important: You have to create the new Beanstalk environment in exactly the same Availability Zone within your Region as your EBS Drive resides in! Otherwise you can’t connect it! You can define the Availability Zone in the Capacity settings on the Configure env name page.
Once the environment is running, ssh into the EC2 instance.
Note: You can find the public DNS of your EC2 instance via the EC2 panel of the AWS console.
ssh -i <path-to-pem-file> [email protected]<ec2-instance-public-dns> We can check the mount points now:
$ ls -l /data total 8 drwxr-xr-x 3 root root 4096 Dec 26 16:12 kettle drwxr-xr-x 4 root root 4096 Dec 26 16:11 pentaho $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / xvdh 202:112 0 1G 0 disk /data/kettle xvdi 202:128 0 1G 0 disk /data/pentaho As you can see, now everything looks fine.
Once you have the new configuration running, you might want to check if the new volumes got created: You can do this by going to the EC2 section of the AWS console. On the side panel under Elastic Block Storage click on Volumes:

Note: Elastic Beanstalk will by default create also a volume
/dev/xvdczto store the Docker image.
Actually, what we did was a bit too complex (but might be required in some scenarios): We could have simply just mapped the /root folder of the Docker container to the /data folder on the EC2 instance and created a mount point /data that links the EC2 instance directory to the EBS volume. This way all the data is contained in one drive. Well, as it turns out, this is actually not a good idea, as we get loads of other files/folders as well:
[[email protected] ~]$ ls -la /data total 40 drwxr-xr-x 7 root root 4096 Dec 26 21:36 . drwxr-xr-x 26 root root 4096 Dec 26 21:24 .. drwxr-x--- 3 root root 4096 Dec 26 21:36 .java drwxr-x--- 3 root root 4096 Dec 26 21:37 .kettle drwx------ 2 root root 16384 Dec 26 21:24 lost+found drwxr-x--- 3 root root 4096 Dec 26 21:26 .m2 drwxr-x--- 3 root root 4096 Dec 26 21:36 .pentaho Ok, so instead of this, we can leave the original Docker container to EC2 instance volume mapping in place:
| Docker container path | EC2 instance volume path |
|---|---|
/root/.kettle | /data/kettle |
/root/.pentaho | /data/pentaho |
And just use one ESB volume, which we mount to /data.

Create a new project directory called beanstalk-with-ebs-one-volume.
Add a new Dockerrun.aws.json file to this folder, which looks like this (it’s exactly the same as when we added the volumes originally):
"AWSEBDockerrunVersion": 2, "volumes": [ "name": "kettle", "host": "sourcePath": "/data/kettle" , "name": "pentaho", "host": "sourcePath": "/data/pentaho" ], "containerDefinitions": [ "name": "webSpoon", "image": "hiromuhota/webspoon:latest-full", "essential": true, "memory": 1920, "environment": [ "name": "JAVA_OPTS", "value": "-Xms1024m -Xmx1920m" ], "portMappings": [ "hostPort": 80, "containerPort": 8080 ], "mountPoints": [ "sourceVolume": "kettle", "containerPath": "/root/.kettle", "readOnly": false , "sourceVolume": "pentaho", "containerPath": "/root/.pentaho", "readOnly": false , "sourceVolume": "awseb-logs-webSpoon", "containerPath": "/usr/local/tomcat/logs" ] ] Inside your project directory (beanstalk-with-ebs-one-volume), create a subdirectory called .ebextensions. In the .ebextensions directory create a new file called options.config and populate it with this content:
commands: 01mkfs: command: "mkfs -t ext3 /dev/sdh" 02mkdir: command: "mkdir -p /data" 03mount: command: "mount /dev/sdh /data" option_settings: - namespace: aws:autoscaling:launchconfiguration option_name: BlockDeviceMappings value: /dev/sdh=:1 Now we have to zip our files from within the project root directory:
[[email protected] beanstalk-with-ebs-one-volume]$ zip ../webspoon-with-ebs.zip -r * .[^.]* adding: Dockerrun.aws.json (deflated 65%) adding: .ebextensions/ (stored 0%) adding: .ebextensions/options.config (deflated 43%) Note: The zip file will be conveniently placed outside the project directory.
Next via the Web UI create a new Beanstalk environment. The approach is the same as before, just that instead of the Dockerrun.aws.json you have to upload the zip file now.
Important: You have to create the new Beanstalk environment in exactly the same Availability Zone within your Region as your EBS Drive resides in! Otherwise you can’t connect it! You can define the Availability Zone in the Capacity settings on the Configure env name page.
When we shh into our EC2 instance, we can see:
[[email protected] ~]$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / xvdh 202:112 0 1G 0 disk /data [[email protected] ~]$ ls -l /data/ total 24 drwxr-xr-x 3 root root 4096 Dec 26 22:33 kettle drwx------ 2 root root 16384 Dec 26 22:05 lost+found drwxr-xr-x 3 root root 4096 Dec 26 22:33 pentaho As you can see, our /data directory looks way tidier now.
If we had specified /var/app/current/kettle and /var/app/current/pentaho as mount points we would have run into problems. Everything specified in .ebextensions gets executed before anything in Dockerrun.aws.json. So this approach would have mounted the EBS volumes first under /var/app/current and then later on when Dockerrun.aws.json would have tried to deploy our project, it would have seen that the /var/app/current already exists. In this case it would have renamed it to /var/app/current.old and deployed the app to a fresh new /var/app/current directory.
You can see this when you run the lbslk command to check how the devices were mounted:
[[email protected] ~]$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / xvdh 202:112 0 1G 0 disk /var/app/current.old/kettle Conclusion: That’s why we need a different mount point! We want to specify a custom location that does not interfere with any other process.
Note: Here we only discuss a simple manual approach. This is only sensible if you run a single EC2 node with WebSpoon on it. For a more complex setup with load balancer and auto-scaling an automatic solution should be put in place.
So what is the point of this exercise really? Why did we do this? Our main intention was to have some form of persistent storage. Still, if we were to terminate the Beanstalk environment now, all our EBS volumes would disappear as well! However, via the EC2 panel under Elastic Block Storage there is a way to detach the volume:

The normal Detach Volume command might not work, because the volume is still used by our EC2 instance. You can, however, choose the Force Detach Volume command, which should succeed. Wait until the state of the drive shows available.
Next terminate your current Beanstalk environment. Once it’s terminated, you will see that your EBS Drive is still around. Start a new Beanstalk environment (this time just use the Dockerrun.aws.json file from this section, not the whole zip file - we do not want to create an new EBS drive).
Important: You have to create the new Beanstalk environment in exactly the same Availability Zone within your Region as your EBS Drive resides in! Otherwise you can’t connect it! You can define the Availability Zone in the Capacity settings on the Configure env name page.
Next, on the EC2 page in the AWS Console go to the EBS Volumes, mark our old EBS drive and right click: Choose Attach Volume. In a pop-up window you will be able to define to which instance you want to attach the EBS drive to.
Once it is attached, grab the public URL of your EC2 Instance from the EC2 area of the AWS console (click on Instances in the side panel). Ssh into your EC2 instance and then run the following:
$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / xvdf 202:80 0 1G 0 disk xvdcz 202:26368 0 12G 0 disk Originally I asked the EBS drive to be named sdf, but due to OS specifics it ended up being called xvdf as we can see from running the lsblk command. Note that the last letter remains the same. Also, you can see that it doesn’t have a mount point yet. So now we want to mount the EBS drive to the /data directory:
$ sudo mount /dev/xvdf /data Next you have to restart the Docker container so that the changes can be picked up.
$ sudo docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 18dfb174b88a hiromuhota/webspoon:latest-full "catalina.sh run" 15 minutes ago Up 15 minutes 0.0.0.0:80->8080/tcp ecs-awseb-Webspoon-env-xxx 0761525dd370 amazon/amazon-ecs-agent:latest "/agent" 16 minutes ago Up 16 minutes ecs-agent $ sudo docker restart 18dfb174b88a Note: If you create the EBS Driver upfront separate from the Beanstalk environment, when you later shutdown the environment, the EBS Driver does not get terminated.
Note: This approach is only really sensible if you were to run one EC2 instance only. The beauty of the previous approach is that every EC2 instance spun up via the auto-scaling process will have exactly the same setup (so e.g. one EC2 instance with one EBS drive). So for the approach outlined below, you do not need a load balancer and also not auto-scaling.
If you want to go down the manual road, you can as well create the EBS Drive manually:
Dockerrun.aws.json file from the beanstalk-with-ebs-one-volume project (if you skipped the previous sections, the instructions for setting up the Beanstalk environment are quite at the beginning of this article). This time also change edit the Capacity settings on the Configure env name page. For the Availability Zone define the same zone as your EBS Drive resides in.
$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / xvdf 202:80 0 1G 0 disk xvdcz 202:26368 0 12G 0 disk Originally I asked the EBS drive to be name sdf, but due to OS specifics it ended up being called xvdf as we can see from running the lsblk command. Note that the last letter remains the same. Also, you can see that it doesn’t have a mount point yet.
Since it is a fresh new EBS drive, we have to format it first:
$ sudo mkfs -t ext3 /dev/xvdf Next we want to mount the EBS drive to the /data directory:
$ sudo mount /dev/xvdf /data Next you have to restart the Docker container so that the files can be picked up:
$ sudo docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 18dfb174b88a hiromuhota/webspoon:latest-full "catalina.sh run" 15 minutes ago Up 15 minutes 0.0.0.0:80->8080/tcp ecs-awseb-Webspoon-env-xxx 0761525dd370 amazon/amazon-ecs-agent:latest "/agent" 16 minutes ago Up 16 minutes ecs-agent $ sudo docker restart 18dfb174b88a Once you terminate your environment, the EBS Drive will still be available, so you can later on easily attach it to a new EC2 instance. It behaves this way because you created the EBS Drive separately from the Beanstalk environment.
If you don’t want to use WebSpoon nor the EBS drive for some time, you can take a snapshot of the data on the EBS Drive and store the snapshot on S3. Then you can get rid of the EBS Drive. Whenever you decide to get WebSpoon running again, you can restore the data from the snapshot onto a EBS Drive and attach it to the EC2 instance that is running WebSpoon and all is back to normal again.
Note: An EFS Volume is a network file storage (available in all AWS regions that support it) that can be shared between many EC2 instances at the same time. Because it is a network storage, it will be a bit slower than an EBS Volume (which is a device directly attached to an EC2 instance). Another advantage of an EFS Volume is that it grows or shrinks automatically according to your storage needs.
“You can create an Amazon EFS volume as part of your environment, or mount an Amazon EFS volume that you created independently of Elastic Beanstalk.”
Important: Before you start make sure that EFS is available in your region! If not, change the region selector in the top right hand corner of the AWS console.
Note: This approach is only really suitable for smaller setups.
Go to the EFS Console and click on Create file system button. Provide the relevant details in the wizard dialog. This is really quite easy. Your network drive should be ready in a matter of minutes.
Next you can either ssh into your EC2 instances and run a few commands to mount the EFS Drive or add an additional config file to the beanstalk config files (and there are a few other options available as well). We will go with the first option for now.
Follow these steps:
Dockerrun.aws.json file from the beanstalk-with-ebs-one-volume project.

sudo yum install -y nfs-utils/data), hence we can directly mount the NFS Volume like so:sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 fs-9c912d55.efs.eu-west-1.amazonaws.com:/ /data If you run the ls command on the /data directory you will see that it is empty now. Restart the Docker Container now so that these changes can be picked up.
Categories: None
The words you entered did not match the given text. Please try again.
Oops!
Oops, you forgot something.