NVIDIA Logo

NVIDIA

Software Reliability Engineer - LPU Hardware DataFlow

Posted Yesterday
Be an Early Applicant
Remote
Hiring Remotely in UK
Senior level
Remote
Hiring Remotely in UK
Senior level
The Software Reliability Engineer will focus on hardware reliability testing, develop automated test frameworks, and improve hardware and driver reliability through comprehensive testing and collaboration with teams.
The summary above was generated by AI

NVIDIA is known as the "AI Computing Company." Our GPUs power modern Deep Learning software frameworks, accelerated analytics, data centers, and autonomous vehicles. We seek a Software Reliability Engineer - LPU Hardware DataFlow to join our company and concentrate on hardware reliability testing and driver reliability. You will develop and conduct reliability and qualification campaigns for NVIDIA hardware (accelerators, boards). You will also build and sustain automated test frameworks for driver stability and regression. Additionally, you will lead efforts to improve hardware and driver reliability to meet customer expectations.

In this role you own the reliability of our hardware and driver stack. You will complete and automate hardware stress tests, longevity and environmental tests, and failure analysis; you will also take responsibility for driver reliability testing – stability under load, regression suites, compatibility matrices, and crash/hang triage. Your work ensures that hardware and drivers ship with confidence and that field issues are understood and prevented through improved test coverage and monitoring. The ideal candidate has solid software and automation abilities along with enthusiasm for hardware and low-level software. We seek engineers who consider failure modes and stress scenarios, develop consistent reliability testing processes, and connect driver and hardware behavior to identify root causes of reliability problems.

What you'll be doing:

  • Fix logic bugs before they even happen by providing formal correctness proofs. 

  • Develop and sustain driver reliability test frameworks: automated stability evaluations, regression test suites, and compatibility assessments across OS, driver versions, and hardware SKUs.

  • Diagnose and identify driver and hardware failures: investigate crashes, freezes, and errors; collaborate with driver and hardware groups to resolve problems and enhance test coverage.

  • Establish and track reliability metrics and SLOs for hardware and drivers; perform post-mortems and encourage advancements in test automation and coverage.

  • Build, implement, and run hardware reliability and qualification tests: stress tests, longevity tests, thermal/power cycling, and environmental tests on GPUs and accelerators.

  • Automate test running, result gathering, and reporting; incorporate reliability tests into CI and release workflows; manage lab or farm infrastructure for reliability testing across EMEA and worldwide.

 

What we need to see:

  • BS or higher degree or equivalent experience with 8+ years in reliability engineering, hardware testing, driver testing, or SRE with a focus on hardware/drivers.

  • Functional programming experience (haskell, nix).

  • Strong System level programming experience (C++, Rust, Java).

  • Strong experience with Linux and scripting (Python, Shell) for test automation, result parsing, and tooling.

  • Proficiency in building automated test pipelines; experience with CI/CD and with running tests at scale (e.g. test farms, lab automation).

  • Ability to prioritize failures, examine logs and dumps, and collaborate with driver or hardware teams to identify root causes of issues.

  • Strong communication skills in English; capable of collaborating with distributed teams across EMEA and worldwide.

 

Ways to stand out from the crowd:

  • Experience with GPU or accelerator reliability testing; familiarity with NVIDIA or other GPU/driver ecosystems.

  • Experience with hardware durability or certification testing (stress, longevity, thermal, power) and/or driver consistency and regression testing.

  • Background in driver development, kernel debugging, or low-level software; ability to read driver code and correlate behavior with test failures.

  • Experience with hardware testing tools, lab automation, or DUT (device-under-test) management at scale.

  • Knowledge of reliability standards and methods (e.g. FIT rates, accelerated life testing, failure analysis).

  • Experience with firmware or BIOS reliability testing; understanding of hardware–software interaction and error reporting (e.g. AER, MCE).

Join our team of world-class engineers and be part of the groundbreaking work we do at NVIDIA. We are committed to encouraging a collaborative and inclusive environment, where every team member has the opportunity to thrive and make a significant impact!

Top Skills

C++
Ci/Cd
Haskell
Java
Linux
Nix
Python
Rust
Shell

Similar Jobs

6 Minutes Ago
Remote or Hybrid
Senior level
Senior level
Cloud • Fintech • Information Technology • Machine Learning • Software • App development • Generative AI
The Senior Customer Success Manager will build relationships with customers, guide them in utilizing solutions, and ensure their success and satisfaction, while mentoring team members and collaborating with sales.
Top Skills: Cloud SoftwareCrm SoftwareGainsightSalesforce
12 Hours Ago
In-Office or Remote
United Kingdom
Senior level
Senior level
Productivity • Software • App development • Automation
The QA Automation Lead will design automated test frameworks, ensure product quality, mentor engineers, and collaborate with development teams.
Top Skills: C#C++Ci/CdLinuxOsxPythonSdksWindows
12 Hours Ago
In-Office or Remote
GB
Mid level
Mid level
Productivity • Software • App development • Automation
As a C++ Engineer, you will develop machine learning features, integrate models, maintain SDKs, and collaborate across teams for software performance and enhancements.
Top Skills: AWSC++CmakeDockerGCPGitJenkinsNumpyOpencvPython

What you need to know about the Belfast Tech Scene

If asked to name the birthplace of the RMS Titanic, you might not say Belfast. Similarly, if asked to name Europe's leading destination for foreign direct investment in new software development, Belfast might not come to mind. Yet, both are true. The city has emerged as a tech powerhouse, recently ranked among the best in the U.K. for tech careers — especially for software developers. It also leads the U.K. with the highest percentage of software development jobs advertised.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account