System Admin IV (HPC)
**IAT Level II Certification Required. Candidates without required certification will not be considered**
In this role, candidates will create and maintain operations of site reliability engineering (SRE) efforts on multi-user High Performance Computing (HPC) systems using a variety of configuration management, IT monitoring, and automation tools within a Linux environment (RedHat, CentOS). Candidates will work to create a new Nagios Alerting Database, new SRE Database, and develop an effective consistent SRE automation protocol.
Candidates are preferred to have a Bachelor’s degree in Computer Science or related field, and have ten years of demonstrable experience in High Performance Computing systems administration and support of a large client-server based IT enterprise.
Candidates will have experience and/or exposure with automation tools including: Puppet, Salt, Ansible, and Chef. Candidates shall also have experience with scripting in Bash, Python and/or Perl.
Additionally, candidates will have experience or exposure to XFS/ZFS File Systems and NFS/Block Storage FS Sharing; SSH, TMUX, PDSH, CLUSH system access; VI, EMACS, AWK/SES, CRON system editing; and Nagios, Ganglia, SNMP information technology monitoring systems.
Salt Lake City, UT or Annapolis Junction, MD
REQUIRES ACTIVE TS/SCI CLEARANCE WITH POLY