To run very large FDTD jobs it is often useful to distribute them across several servers to either speed up the compute time or access more RAM than is available on a single server. Amazon EC2 is a convenient way to buy compute time on-demand and makes it possible to access multiple large servers and only pay for the time you use.
This can greatly speed up FDTD jobs as you can run them much faster using several servers at a time, but the cost can be about the same as running on one server for a longer period of time.
This article shows a basic approach to configuring your AWS account to run distributed FDTD jobs on EC2. It runs the FDTD engine on Amazon Linux, which is cost-effective for compute jobs and has access to GUI via X11.
The simulation files are stored in S3 to transfer back and forth to your PC or laptop for interactive analysis of the results. All of the interaction with AWS is done in a web browser through the AWS Management Console.
There are many different approaches to running on EC2, this article is good for the following scenarios:
- Running very large FDTD jobs that won't fit on a single compute node
- A single user or a small team of users sharing the same AWS account
- Occasional to moderate use of EC2
- A few simulation jobs per day
- Low-cost access to computing time using spot instances
This may not be the best approach for a large team, running small FDTD jobs, parameter sweeps or inverse design. Please contact Lumerical support for other use case scenarios or configurations.
- Experience with AWS console
- Experience working with SSH (x11) + SCP
AWS charges are based on the amount of time and storage used. The majority costs incurred when running Lumerical Products on AWS while following our most common use-cases are:
- EC2 (Elastic Cloud Compute) Instances: For EC2 you only pay for what you use on a sub-second basis (while the instance is in the "Running" state). Our uses cases use on-demand instances, however, there are cheaper hourly rates available for longer-term contracts and spot computing instances. See Amazon EC2 pricing for details.
- EBS (Elastic Block Storage): These volumes are used by each EC2 instance as a system drive and are a fixed size set when the instance is launched. Charges are incurred whether the instances are in the "Running" or "Stopped" state. See Amazon EBS Pricing for details.
- EFS (Elastic File System): These volumes are dynamically sized and you only pay for the storage you use. These volumes are used as a shared filesystem between multiple instances. See Amazon EFS pricing for details.
Example on-demand estimate while testing (USD):
- License server (t3.nano, 10 GB storage): $5/month
- Compute instances (2 x c5.large, 20 GB storage each): $128/month, $0.18/hour
- Shared storage (EFS, 5 GB usage): $5/month
Example on-demand estimate for production (USD):
- License server (t3.nano, 10 GB storage): $5/month
- Compute instances (4 x c5n.18xlarge, 20 GB storage each): $11,360/month, $16/hour
- Shared storage (EFS, 1 TB usage): $300/month
1. Cluster Configuration
- Create and configure your VPC and Security Group
- Create a key-pair and IAM Role
- Create and Configure a License Server
- Create a Linux Compute AMI for use with a cluster
1.2 Create a placement group
Placement groups tell EC2 to create servers that are located close to one and other to ensure there is good network performance between them. This is important for distributed FDTD jobs.
- Placement groups are added under the EC2 service and only require a name.
1.4 Create a launch template
A launch template saves all the settings for launching server instances. This makes it faster and repeatable to launch instances. We’ll create a launch template to launch our compute AMI in our placement group and using our IAM role.
- Go to INSTANCES > Launch Templates on the EC2 dashboard
- Create launch template
- Provide the required information;
- template name
- template version description
- AMI ID
- instance type (e.g. c5 instances)
- keypair name
- security group
- edit Advanced Details
- shutdown behavior = Stop
- Placement group name
- Create launch template
- The Launch template will be used when creating more nodes as required by your simulation project.
2. Running a job
2.1 Upload an .fsp file to your bucket
The job launch script will download your file from S3. Make a different folder for each file. Lumerical FDTD can upload files directly to S3 if you configure your PC.
2.2 Launch instances from template
We’ll use the launch template we created to launch 2 instances. With this technique, it is simple to launch as many instances as you like (or your account limits permit). We have already specified all the details to launch instances, so we just pick the template and set the number of instances. You can override the instance type in the template if you want to try different instances to optimize for your job. In the video, we chose a smaller instance. Typically the c5n.18xlarge is a good default choice.
- Go to INSTANCES > Launch Templates on the EC2 dashboard.
- Select the Launch template you created
- Click on Actions and select "Launch instance from template"
- Select the template version (Default if this is new and only first version)
- Indicate how many instances (nodes) you want to create.
- Check and confirm that instance details and advanced details are correct.
e.g. AMI ID, type, subnet, security groups, key pair, storage, shutdown behavior, and placement group name.
- press "Launch instance from template"
2.3 Login and launch a job
Once the instances are up we’ll connect with ssh to one instance. It doesn’t matter which one, but it’s a good idea to add a note on the one you choose as this will be the root node if you want to log in again and terminate the job early or check the progress.
The Lumerical AMI has a script 'aws-fdtd.py' that we use to launch the job.
- Downloads your project file. You only have to give it the folder path as the command line argument. You’ve already configured the S3 bucket and permissions. It’ll throw an error if there is more than one file in the bucket.
- Creates a temp folder to store the job file and outputs
- Discovers the other servers you have launched and collects all the IP addresses. It will find all the servers of the same type in the same placement group. If you want to run multiple jobs at the same time you’ll need to create more placement groups and set those when you launch the instances.
- Constructs the command to run the FDTD job and runs it.
- When the job is finished, it will upload all the output files to the same S3 path in a folder
- Terminate all of the instances used in the job
- This approach is intended for on-demand usage where you start EC2 resources for each big job and terminate them when the job completes. You can ssh into the root node and monitor the log file for progress if it is a very long job.
- You can terminate a job early by sending the kill signal to mpiexec. The job will save all the
results an exit cleanly.
kill `pidof mpiexec`
- You can also choose to manually configure your cluster and run jobs.
- You will need to make note of the IP address of the launched nodes and run the products Resource Configuration, or CLI. You will need to manually terminate your compute nodes when you are done.
# The first node in the host list must have local access to the .fsp
/opt/intel/impi/2018.4.274/bin64/mpiexec -host 10.0.0.132,10.0.0.50 -perhost 2 /opt/lumerical/2019b/bin/fdtd-engine-impi-lcl ~/test.fsp
2.4 Download results
Once the instance terminates or you are forcefully logged out from ssh, you can find all the output files in your S3 bucket.
3. Spot instances
Spot Instances allow you to bid for compute time at a much lower rate than on-demand. The cost of this is that the instances can be terminated at any time if there is insufficient capacity or you are outbid. You can use spot instances with this approach, you simply need to check the spot instance option when you launch your instances from a template.
If your spot instances are terminated, the Lumerical AMI runs a background job that gets a 2 min terminate warning from EC2. It will stop the job early so it can upload the incomplete results to S3.
Using spot instances like this carries a risk of wasting money if a very big job is terminated too early and the results are unusable. It can save a lot of money if you can tolerate losing and redoing the odd job.