Returning Candidate?

Principal Architect – Enterprise AI / ML

Job ID: 2025-5177
Name Linked: Remote: US
Country: United States
City: Remote
Worker Type: Regular Full-Time Employee

Overview

This is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. DataDirect Networks (DDN) is a global market leader renowned for powering many of the world's most demanding AI data centers, in industries ranging from life sciences and healthcare to financial services, autonomous cars, Government, academia, research and manufacturing.

"DDN's A3I solutions are transforming the landscape of AI infrastructure." – IDC

“The real differentiator is DDN. I never hesitate to recommend DDN. DDN is the de facto name for AI Storage in high performance environments” - Marc Hamilton, VP, Solutions Architecture & Engineering | NVIDIA

DDN is the global leader in AI and multi-cloud data management at scale. Our cutting-edge data intelligence platform is designed to accelerate AI workloads, enabling organizations to extract maximum value from their data. With a proven track record of performance, reliability, and scalability, DDN empowers businesses to tackle the most challenging AI and data-intensive workloads with confidence.

Our success is driven by our unwavering commitment to innovation, customer-centricity, and a team of passionate professionals who bring their expertise and dedication to every project. This is a chance to make a significant impact at a company that is shaping the future of AI and data management.

Our commitment to innovation, customer success, and market leadership makes this an exciting and rewarding role for a driven professional looking to make a lasting impact in the world of AI and data storage.

Job Description

We are seeking a Principal Architect – Enterprise AI/ML to design and develop AI-driven solutions leveraging large-scale data architectures for high-performance training and inference workloads. This role will focus on integrating AI/ML technologies with Infinia, a large-scale distributed data platform, data lakes, and high-throughput data pipelines to support enterprise-scale AI applications. The ideal candidate will have expertise in AI infrastructure, large-scale storage optimization, distributed training, and cloud-native AI architectures.

Key Responsibilities:

Architect and optimize AI/ML solutions that efficiently manage and process large-scale data platforms for training and inference.
Design high-throughput data pipelines that integrate with Infinia intelligent data platform to support LLMs, deep learning, and AI workloads.
Develop scalable, low-latency data solutions to enhance model training efficiency and support real-time AI inference applications.
Collaborate with engineers and cloud architects to implement AI-driven data management, caching, and prefetching strategies for high-performance ML workloads.
Optimize AI data access patterns, including sharded datasets, vector databases, and parallel I/O techniques to reduce bottlenecks in AI training pipelines.
Implement ML model checkpointing, distributed training strategies (FSDP, DeepSpeed, Megatron-LM), and data-efficient model deployment.
Work with GPUs, TPUs, and AI accelerators to ensure seamless integration of AI training/inference workloads with distributed data infrastructure.
Drive data governance, AI security, and compliance for large-scale AI/ML data pipelines.
Stay ahead of advancements in AI-driven distributed data solutions, next-generation file systems, and distributed AI workloads.

Required Qualifications:

Master’s or Ph.D. in Computer Science, AI/ML, Data Engineering, or related field (or equivalent experience).
15+ years of experience in AI/ML infrastructure, large-scale data systems, and distributed computing.
Deep expertise in distributed storage systems (e.g., Lustre, Ceph, GPFS, AWS S3, HDFS) and high-throughput AI data pipelines.
Experience with AI/ML frameworks (e.g., PyTorch, TensorFlow, JAX) and distributed training techniques (e.g., FSDP, DeepSpeed, Horovod).
Strong understanding of high-performance computing (HPC), NVMe-based caching, and parallel I/O optimizations for AI workloads.
Proficiency in Python, C++, and APIs for AI data processing.
Experience deploying AI solutions in cloud and hybrid environments (AWS/GCP/Azure) using Kubernetes, Ray, and containerized AI workflows.
Knowledge of vector databases (FAISS, Milvus, Weaviate) and AI-optimized data platforms for embedding retrieval.
Strong technical leadership with a proven track record of architecting AI/ML solutions integrated with enterprise-scale data systems.

Preferred Qualifications:

Experience optimizing LLM training/inference on large-scale data infrastructure.
Familiarity with storage-aware AI model compression, distillation, and checkpointing strategies.
Solid experience with open-source big data and AI frameworks.
Understanding of federated learning, data residency laws, and AI governance for enterprise data.

This position requires participation in an on-call rotation to provide after-hours support as needed.

If you’re passionate about scaling AI/ML solutions with high-performance data architectures, and working at the intersection of AI, large-scale data systems, and high-performance computing, we’d love to hear from you!

DDN

Join our dynamic and driven team, where engineering excellence is at the heart of everything we do. We seek individuals who love to challenge themselves and are fueled by curiosity. Here, you'll have the opportunity to work across various areas of the company, thanks to our flat organizational structure that encourages hands-on involvement and direct contributions to our mission. Leadership is earned by those who take initiative and consistently deliver outstanding results, both in their work ethic and deliverables, making strong prioritization skills essential. Additionally, we value strong communication skills in all our engineers and researchers, as they are crucial for the success of our teams and the company as a whole.

Interview Process: After submitting your application, one of our recruiters will review your resume. If your application passes this stage, you will be invited to a 30-minute interview during which a member of our team will ask some basic questions. If you clear the interview, you will enter the main process, which can consist of up to four interviews in total:

Coding assessment: Often in a language of your choice.
Systems design: Translate high-level requirements into a scalable, fault-tolerant service (depending on role).
Real-time problem-solving: Demonstrate practical skills in a live problem-solving session.
Meet and greet with the wider team.
Our goal is to finish the main process in 2-3 weeks at most.

DataDirect Networks (DDN) is an Equal Opportunity/Affirmative Action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity, gender expression, transgender, sex stereotyping, sexual orientation, national origin, disability, protected Veteran Status, or any other characteristic protected by applicable federal, state, or local law.

#LI-Remote

Options

Apply for this job onlineApply

Refer this job to a friendRefer

Sorry the Share function is not working properly at this moment. Please refresh the page and try again later.

Share on your newsfeed

Application FAQs