HPC Storage Engineer (Remote)

Apply for this position Please mention DailyRemote when applying
Posted 2 days ago United States Salary undisclosed
Before you apply - make sure the job is legit.

Attempting to apply for jobs might take you off this site to a different website not owned by us. Any consequence as a result for attempting to apply for jobs is strictly at your own risk and we assume no liability.

Job Description

RedLine Performance Solutions (RedLine) has been in the HPC solutions engineering services business for 21 years and is consistently determined to keep the "bar of excellence" quite high for new hires. This enables RedLine to accomplish what other firms cannot and promotes a high level of staff retention. We offer services ranging from full life cycle HPC systems engineering to remote managed services to HPC program analysis. We are located in the Washington, DC area and are looking for an HPC Storage Engineer to join us for our NASA NACS High Performance Computing contract.U.S. citizenship and the ability to obtain a Public Trust security clearance are mandatory requirements for this position. The position is located at the customer site in Greenbelt, MD. Strong preference for local candidates. Will consider primarily remote work for candidates who have significant experience in all of SAN, storage controllers, and high-performance filesystems, after extensive initial onsite orientation period. If the candidate works remote, travel to Greenbelt, MD, will still be required on at least a quarterly basis.This position will interact with the program manager, site lead, customer, and site staff, attending regularly scheduled customer meetings to keep stakeholders informed of activities and progress, and answer inquiries concerning all aspects of the program. An individual at this skill level should have demonstrated problem-solving ability in relevant areas of expertise, with technical publications and/or formal technical presentations, and should have some experience in mentoring and leading others in small team environments.Duties and Responsibilities:Design (architect), implement and troubleshoot large-scale (tens of Petabytes) storage systems serving thousands of nodes. This includes developing technical drawings (including all required cables and connectivity to existing systems), and communicating with key stakeholders.Serve as a GPFS SME for the Discover HPC team as well as other teams running GPFS, both within and outside of the immediate organization.Develop and execute test plans for upgrading filesystems, and for isolating and resolving issues (in collaboration with vendors when beneficial).Resolve user-reported issues (e.g., filesystem, RDMA interconnect, kernel, operating system).In collaboration with HPC team, evaluate and test proposed changes to the Discover cluster's production operating environment (e.g. OS patches and kernel parameter changes), and develop plans for OS upgrade and contingency-downgrade. Maintain the storage aspects of the Discover Test and Development System (TDS), keeping it as close as reasonably possible to the production cluster configuration.Requirements:Bachelor's degree in Computer Science, Management Information Systems or other technical discipline, plus 5 years of experience, or equivalent.At least five years of experience as an HPC parallel-filesystem storage administrator, with experience with IBM Spectrum Scale (GPFS) or Lustre, or equivalent. Experience with optimizing for performance, reliability, and security.In-depth knowledge of HPC parallel filesystems and the ability to troubleshoot complex problems. Must be comfortable with monitoring and managing clustered filesystems, and be able to examine GPL driver code when required.In-depth knowledge of Linux NFS server/client implementation and ability to troubleshoot NFS issues.Experience with InfiniBand or OmniPath high speed fabrics, including RDMA based storage, subnet management, fabric topology and health monitoring, and parallel I/O over MPI. Proficiency with at least one shell-scripting language (bash, csh, tcsh), and at least one interpreted language (Perl, Python, Ruby). Good organization skills to balance and prioritize work.Good communication skills to communicate with teammates, customer, and managers, and vendor support personnel.Preferred Skills:Knowledge of SAN technologies (e.g., FC, FCoE, RoCE, NVMoF, iSER, SRP) and awareness of high-level protocol function, management approaches, and performance benchmarking and tuning.Knowledge of Ethernet networking (VLANs, etc.) as related to parallel filesystems. Experience with deploying parallel-filesystem upgrades in a rolling fashion with no overall system downtime. Experience with GPFS Cluster Export Services, Clustered NFS, GPFS Multi-cluster.Knowledge of distributed file systems and object stores such as Lustre, HDFS, BeeGFS, Ceph, Swift.Experience with applying patches and tuning kernel parameters as required to implement functionality, or address performance or security concerns. Familiarity with out-of-band management techniques (e.g., IPMI).Experience with revision control via Git.Experience deploying and managing large HPC clusters.Experience in submitting job scripts to a batch scheduler (ideally Slurm).Knowledge of MPI Implementations (Intel MPI, MVAPICH2, OpenMPI, HPE/SGI MPT) and troubleshooting MPI application stability and performance problems.Working knowledge of programming languages such as Fortran, C, C++.To learn more about RedLine please visit our website at