Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
A key aspect of your high performance computing migration is the configuration of job schedulers. Job schedulers are responsible for scheduling user jobs, that is, determining where and when jobs should be executed. In the context of the cloud, job schedulers interact with resource orchestrators to acquire/release resources on-demand, which is different from an on-premises environment where resources are fixed and fully available all the time. This part of the guide covers the needs, tools, services, and best practices associated with your job schedulers.
The most common HPC job schedulers are Slurm, OpenPBS, PBSPro, and LSF.
Define job scheduler needs
Scheduler deployment:
- Migrate existing job scheduler configurations to the cloud environment.
- Utilize the same scheduler available within CycleCloud or migrate to a different scheduler if necessary.
Configuration management:
- Configure job schedulers to define partitions/queues, Azure SKUs, compute node hostnames, and other parameters.
- Automatically update scheduler configurations on-the-fly based on changes in resource availability and job requirements.
Job submission and management:
- Allow end-users to submit jobs for execution according to scheduling and resource access policy rules.
- Monitor and manage job queues, resource utilization, and job statuses.
Best practices for job schedulers in HPC lift and shift architecture
Efficient scheduler deployment:
- Plan and test the migration of existing job scheduler configurations to the cloud environment to ensure compatibility, performance, and user experience.
- Use CycleCloud's built-in support for schedulers like Slurm, OpenPBS, PBSPro, and LSF for a smoother deployment process.
Optimized configuration management:
- To align with changing resource availability and job requirements, regularly update scheduler configurations (for example, scheduler queues/partitions).
- Automate configuration changes using scripts and tools to minimize manual intervention and reduce the risk of errors.
Robust job submission and management:
- Implement a user-friendly interface for job submission and management to facilitate end-user interaction with the scheduler.
- To identify and address potential issues promptly, continuously monitor job queues, resource utilization, and job statuses.
Scalability and performance:
- Configure dynamic scaling policies to automatically adjust the number of compute nodes based on job demand, optimizing resource utilization and cost.
- Use performance metrics and monitoring tools to continuously assess and improve the performance of the job scheduler and the overall HPC environment.
These best practices help ensure a smooth transition to cloud-based job scheduling, maintaining efficiency, scalability, and performance for HPC workloads.
Example job scheduler submission
Submit Slurm interactive job using srun:
#!/bin/bash
# Submit a job using srun
srun --partition=debug --ntasks=1 --time=00:10:00 --job-name=test_job --output=output.txt my_application
Submit Slurm batch script using sbatch:
#!/bin/bash
# Create a Slurm batch script
echo "#!/bin/bash
#SBATCH --partition=debug
#SBATCH --ntasks=1
#SBATCH --time=00:10:00
#SBATCH --job-name=test_job
#SBATCH --output=output.txt
# Run the application
my_application" > job_script.sh
# Submit the batch job
sbatch job_script.sh