Returning Candidate?

L4 Support Engineer – AI-Enabled Escalations – Infinia

Job ID: 2025-5089
Name Linked: Remote: US
Country: United States
City: Remote
Worker Type: Regular Full-Time Employee

Overview

This is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. DataDirect Networks (DDN) is a global market leader renowned for powering many of the world's most demanding AI data centers, in industries ranging from life sciences and healthcare to financial services, autonomous cars, Government, academia, research and manufacturing.

"DDN's A3I solutions are transforming the landscape of AI infrastructure." – IDC

“The real differentiator is DDN. I never hesitate to recommend DDN. DDN is the de facto name for AI Storage in high performance environments” - Marc Hamilton, VP, Solutions Architecture & Engineering | NVIDIA

DDN is the global leader in AI and multi-cloud data management at scale. Our cutting-edge data intelligence platform is designed to accelerate AI workloads, enabling organizations to extract maximum value from their data. With a proven track record of performance, reliability, and scalability, DDN empowers businesses to tackle the most challenging AI and data-intensive workloads with confidence.

Our success is driven by our unwavering commitment to innovation, customer-centricity, and a team of passionate professionals who bring their expertise and dedication to every project. This is a chance to make a significant impact at a company that is shaping the future of AI and data management.

Our commitment to innovation, customer success, and market leadership makes this an exciting and rewarding role for a driven professional looking to make a lasting impact in the world of AI and data storage.

Job Description

As a Level 4 Support Engineer for DDN Infinia, you’ll be the final escalation point for the most complex and critical issues affecting enterprise and hyperscale environments. This hands-on role is ideal for a deep technical expert who thrives under pressure and has a passion for solving distributed system challenges at scale.

You’ll collaborate with Engineering, Product Management, and Field teams to drive root cause resolutions, define architectural best practices, and continuously improve product resiliency. Leveraging AI tools and automation, you’ll reduce time-to-resolution, streamline diagnostics, and elevate the support experience for strategic customers.

Key Responsibilities

Technical Expertise & Escalation Leadership

Own critical customer case escalations end-to-end, including deep root cause analysis and mitigation strategies.
Act as the highest technical escalation point for Infinia support incidents — especially in production-impacting scenarios.
Lead war rooms, live incident bridges, and cross-functional response efforts with Engineering, QA, and Field teams.
Utilize AI-powered debugging, log analysis, and system pattern recognition tools to accelerate resolution.

Product Knowledge & Value Creation

Become a subject-matter expert on Infinia internals: metadata handling, storage fabric interfaces, performance tuning, AI integration, etc.
Reproduce complex customer issues and propose product improvements or workarounds.
Author and maintain detailed runbooks, performance tuning guides, and RCA documentation.
Feed real-world support insights back into the development cycle to improve reliability and diagnostics.

Customer Engagement & Business Enablement

Partner with Field CTOs, Solutions Architects, and Sales Engineers to ensure customer success.
Translate technical issues into executive-ready summaries and business impact statements.
Participate in post-mortems and executive briefings for strategic accounts.
Drive adoption of observability, automation, and self-healing support mechanisms using AI/ML tools.

Required Qualifications

8+ years in enterprise storage, distributed systems, or cloud infrastructure support/engineering.
Deep understanding of file systems (POSIX, NFS, S3), storage performance, and Linux kernel internals.
Proven debugging skills at system/protocol/app levels (e.g., strace, tcpdump, perf).
Hands-on experience with AI/ML data pipelines, container orchestration (Kubernetes), and GPU-based architectures.
Exposure to RDMA, NVMe-oF, or high-performance networking stacks.
Exceptional communication and executive reporting skills.
Experience using AI tools (e.g., log pattern analysis, LLM-based summarization, automated RCA tooling) to accelerate diagnostics and reduce MTTR.

Preferred Qualifications

Experience with DDN, VAST, Weka, or similar scale-out file systems.
Strong scripting/coding ability in Python, Bash, or Go.
Familiarity with observability platforms: Prometheus, Grafana, ELK, OpenTelemetry.
Knowledge of replication, consistency models, and data integrity mechanisms.
Exposure to Sovereign AI, LLM model training environments, or autonomous system data architectures.

This position requires participation in an on-call rotation to provide after-hours support as needed.

Success Metrics – First 30 Days

Technical Ramp-Up

- Complete Infinia training, labs, and architecture deep dives.
- Stand up a fully functioning Infinia test system.
- Shadow at least 5 complex escalations and participate in 2 customer calls.

Operational Integration

- Lead one live incident response and deliver a full RCA within 48 hours.
- Propose 3+ enhancements to internal tools, AI/automation usage, or documentation.
- Establish key partnerships with Engineering and Field teams.

Strategic Insight

- Deliver a written 30-day reflection with gaps and high-impact recommendations.
- Begin identifying patterns where AI or automation can reduce MTTR or improve proactive detection.

Success Metrics – Beyond 30 Days

MTTR on high-severity cases consistently below internal SLAs.
Volume and quality of resolved L4 escalations.
Strategic tooling or automation contributions adopted across the support org.
Executive-ready RCAs that inform product improvement.
High-impact engagements with strategic accounts (prevention, performance tuning, etc.).

DDN

Join our dynamic and driven team, where engineering excellence is at the heart of everything we do. We seek individuals who love to challenge themselves and are fueled by curiosity. Here, you'll have the opportunity to work across various areas of the company, thanks to our flat organizational structure that encourages hands-on involvement and direct contributions to our mission. Leadership is earned by those who take initiative and consistently deliver outstanding results, both in their work ethic and deliverables, making strong prioritization skills essential. Additionally, we value strong communication skills in all our engineers and researchers, as they are crucial for the success of our teams and the company as a whole.

Interview Process: After submitting your application, one of our recruiters will review your resume. If your application passes this stage, you will be invited to a 30-minute interview during which a member of our team will ask some basic questions. If you clear the interview, you will enter the main process, which can consist of up to four interviews in total:

Coding assessment: Often in a language of your choice.
Systems design: Translate high-level requirements into a scalable, fault-tolerant service (depending on role).
Real-time problem-solving: Demonstrate practical skills in a live problem-solving session.
Meet and greet with the wider team.
Our goal is to finish the main process in 2-3 weeks at most.

DataDirect Networks (DDN) is an Equal Opportunity/Affirmative Action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity, gender expression, transgender, sex stereotyping, sexual orientation, national origin, disability, protected Veteran Status, or any other characteristic protected by applicable federal, state, or local law.

#LI-Remote

Options

Apply for this job onlineApply

Refer this job to a friendRefer

Sorry the Share function is not working properly at this moment. Please refresh the page and try again later.

Share on your newsfeed

Application FAQs