OrangeGrid Cluster Specifications
Technical specifications for the OrangeGrid High-Throughput Computing (HTC) cluster.
Overview
Scheduler: HTCondor
Type: High-Throughput Computing (HTC)
Best For: Many independent jobs, parameter sweeps, batch processing, embarrassingly parallel workloads
Access: SSH via its-og-loginX.syr.edu (X = your assigned login node number, specified in your welcome email)
Cluster Architecture
OrangeGrid is a heterogeneous pool of compute resources designed for high-throughput workloads. Unlike traditional HPC clusters with uniform nodes, OrangeGrid consists of many different types of machines, allowing the scheduler to match jobs to available resources dynamically.
Key Characteristics
- Opportunistic scheduling - Jobs run when resources become available
- Fair-share allocation - Resources distributed equitably among all users
- Heterogeneous pool - Variety of CPU and GPU configurations
- Best for independent jobs - Each job runs separately without inter-job communication
Requesting Resources
Basic Job Submission
HTCondor uses submit files to define your job:
executable = my_script.sh
# Request resources
request_cpus = 1
request_memory = 4GB
request_disk = 10GB
# Optional: Request GPU
+request_gpus = 1
# Output files
output = job.$(cluster).$(process).out
error = job.$(cluster).$(process).err
log = job.$(cluster).log
queue 1
Note: Your home directory (/home/netid/) is automatically mounted on compute nodes, so your scripts and data are already accessible - no file transfer needed!
CPU Requests
# Single CPU (default)
request_cpus = 1
# Multiple CPUs for multi-threaded applications
request_cpus = 4
# Note: Each job runs on a single node
# For truly parallel work across nodes, use Zest with MPI
Memory Requests
# Specify in MB or GB
request_memory = 2GB
request_memory = 2048MB # Same as 2GB
# Be accurate - jobs killed if they exceed requested memory
# Under-requesting causes job failure
# Over-requesting means longer wait times
Disk Space
# Local disk space needed during job execution
request_disk = 5GB
# This is temporary space on the execution node
# Files transferred back based on transfer settings
GPU Resources
Available GPUs
OrangeGrid has GPU nodes with various models. The scheduler will match your job to an available GPU.
GPU models in the pool:
- NVIDIA A100
- NVIDIA L40S
- NVIDIA A6000
- Other models available
Requesting GPUs
# Request any available GPU
+request_gpus = 1
# Multiple GPUs (if your code supports it)
+request_gpus = 2
Best Practice: Let HTCondor assign any available GPU unless your code specifically requires a certain model.
Storage
Home Directory
- Path:
/home/netid/ - Type: NetApp storage
- Accessibility: Available on submit/login nodes and automatically mounted on compute nodes when your job runs
- Use for: Scripts, code, data, conda environments, results
How Storage Works
Unlike some clusters, OrangeGrid mounts your home directory on compute nodes automatically. This means:
- ✅ Your scripts are already accessible when jobs run
- ✅ Your data files are available without transfer
- ✅ Results are saved directly to your home directory
- ✅ No need to specify file transfer in submit files
Best Practices:
- Organize with subdirectories:
~/data/,~/scripts/,~/results/ - Output files are automatically saved to the directory where you submit from (or specify paths)
- For very large datasets (multi-TB), contact researchcomputing@syr.edu to discuss optimal storage strategies
Submitting Multiple Jobs
Job Arrays
Submit many similar jobs with one command:
executable = process.py
arguments = input_$(Process).dat
request_cpus = 1
request_memory = 2GB
output = job_$(Process).out
error = job_$(Process).err
log = jobs.log
# Submit 100 jobs, numbered 0-99
queue 100
Queue with Variables
Process multiple items:
executable = analyze.py
arguments = $(item)
# Queue one job for each item
queue item from (
dataset1.csv
dataset2.csv
dataset3.csv
)
See Multiple Jobs & Arrays Guide for more examples.
Checking Cluster Status
# View your jobs
condor_q netid
# View all jobs in the pool
condor_q
# Detailed job information
condor_q -long <jobid>
# View available machines
condor_status
# View your priority and usage
condor_userprio
# Job history
condor_history netid
# Watch job status (updates every 5 seconds)
watch -n 5 condor_q netid
Job States
When you run condor_q, you’ll see jobs in different states:
| State | Meaning | Description |
|---|---|---|
| I | Idle | Job is waiting in queue for resources |
| R | Running | Job is currently executing |
| H | Held | Job has a problem requiring manual intervention |
| C | Completed | Job finished and ready to be removed from queue |
| X | Exiting | Job is cleaning up |
Dealing with Held Jobs
If a job enters “H” (held) state:
# Check why job is held
condor_q -hold <jobid>
# Common reasons:
# - Requested more memory than available
# - Executable not found
# - File transfer failed
# Release held job after fixing issue
condor_release <jobid>
# Or remove and resubmit
condor_rm <jobid>
Common Job Patterns
Basic Python Job
executable = /usr/bin/python3
arguments = my_script.py
request_cpus = 1
request_memory = 4GB
output = job.$(cluster).$(process).out
error = job.$(cluster).$(process).err
log = job.$(cluster).log
queue 1
GPU Job
executable = /usr/bin/python3
arguments = train_model.py
request_cpus = 4
request_memory = 16GB
+request_gpus = 1
output = gpu_job.$(cluster).$(process).out
error = gpu_job.$(cluster).$(process).err
log = gpu_job.$(cluster).log
queue 1
Job with Conda Environment
Create wrapper script (run_with_conda.sh):
#!/bin/bash
source ~/miniconda3/etc/profile.d/conda.sh
conda activate myenv
python my_analysis.py
Submit file:
executable = run_with_conda.sh
request_cpus = 1
request_memory = 4GB
output = conda_job.$(cluster).$(process).out
error = conda_job.$(cluster).$(process).err
log = conda_job.$(cluster).log
queue 1
Parameter Sweep
executable = simulation.py
arguments = --param $(Process)
request_cpus = 1
request_memory = 2GB
output = sim_$(Process).out
error = sim_$(Process).err
log = simulations.log
# Run 1000 simulations with different parameters
queue 1000
Important Notes
Interactive Development Prohibited
OrangeGrid is for batch job submission only. Do not run:
- Jupyter notebooks
- IDEs (Spyder, VSCode, etc.)
- Interactive development tools
- Long-running tests on login nodes
For development, use Google Colab or your local machine.
Job Duration
- Most jobs complete within hours to days
- No hard runtime limits, but fair-share scheduling applies
- Very long jobs may be preempted during high-demand periods
- For guaranteed long runtimes (weeks), consider Zest’s longjobs partition
Monitoring Resource Usage
After jobs complete, check resource usage to optimize future requests:
# View job statistics
condor_history <jobid> -limit 1 -long | grep -E 'Memory|Disk|Cpu'
# Useful for tuning requests:
# - MemoryUsage: Actual memory used
# - DiskUsage: Actual disk used
# - RemoteUserCpu: CPU time used
Adjust your request_memory and request_cpus based on actual usage to get jobs scheduled faster.
External Resources
- HTCondor Quick Start Guide
- HTCondor User Manual
- HTCondor Submit File Reference
- OrangeGridExamples Repository
Getting Help
Questions about OrangeGrid specifications or optimal job configuration?
📧 Email researchcomputing@syr.edu