Senior/Staff AI Cloud Infra Engineer
Comfy Org
Software Engineering, Data Science
Posted on Oct 29, 2025
Senior/Staff AI Cloud Infra Engineer (GPU Compute)
The Role
We're looking for a Cloud Infrastructure Engineer who thrives on building and scaling large-scale GPU compute platforms fast. You'll be instrumental in developing and managing the foundational infrastructure that powers our AI workloads. Our core infrastructure relies heavily on Python, Kubernetes (K8s), Terraform, and Ansible, but we care more about your ability to learn, adapt, and ship robust solutions than whether you've used these exact tools before.
You are a good fit if this describes you
You excel at building and managing distributed compute platforms, especially those involving GPU resources.
You have deep expertise in backend systems that orchestrate complex workloads efficiently, managing capacity and resource constraints.
You possess a strong understanding of foundational cloud infrastructure (AWS/GCP/Azure) and Linux provisioning/management tools.
You know how to design for reliability and scale with minimal operational overhead.
You learn new technologies rapidly because you're excited by solving hard infrastructure challenges.
You've scaled infrastructure before and understand the tradeoffs that matter.
You think most infrastructure moves too slowly and could be way better automated and optimized.
You're comfortable diving into unfamiliar systems and making them work reliably.
You are a self-starter who executes quickly, takes ownership, and constantly seeks improvement.
What you'll do
Develop and maintain our core Python platform for routing requests, orchestrating AI workloads, managing GPU server capacity, observability, and more.
Develop and maintain our infrastructure layer using Terraform, Ansible, and cloud provider APIs to manage our fleet of GPU workers across cloud and potentially bare metal environments.
Own and operate the technologies underpinning our platform, potentially including K8s, FluxCD, Nomad, Prometheus, Thanos, Grafana, Loki, distributed networking/storage, etc.
Architect and implement solutions that directly impact the performance and availability of services for millions of ComfyUI users.
Work closely with our core engineering team to design and build new infrastructure systems.
Help create the vision and lay the foundation for where our infrastructure should go in the next 1/2/5 years.
Help shape our technical direction and infrastructure best practices as we grow.
Requirements
Deep experience building and managing distributed compute platforms, preferably using Python.
Strong foundation in managing cloud infrastructure (AWS, GCP, or Azure). Experience with bare metal is a plus.
Solid understanding of container orchestration (Kubernetes preferred) and CI/CD principles and tools.
Excellent communication skills.
Proven ability to learn fast and ship quality infrastructure code and configurations.
Nice to have
You have excelled at a fast-paced, high-growth tech startup before or are extremely excited about being in one.
Experience specifically with GPU management, scheduling, and monitoring in a large-scale environment.
Experience with specific observability tools (Prometheus, Grafana, Loki, Thanos).
Application Details
This is a remote position based in the United States.
Apply Now
What is ComfyUI?
ComfyUI is the world’s leading visual AI platform — an open, modular system where anyone can build, customize, and automate AI workflows with precision and full control. Unlike most AI tools that hide their inner workings behind a simple prompt box, ComfyUI gives professionals the freedom to design their own pipelines — connecting models, tools, and logic visually like building blocks. It’s used by artists, filmmakers, video game creators, designers, researchers, VFX houses, and among others, teams at OpenAI, Netflix, Amazon Studios, Ubisoft, EA, and Tencent — all who want to go beyond presets and truly shape how AI creates. ComfyUI empowers those who were not trained with the power of the brush to also be a painter, and those who are, to be a maestro.
Built for users who value transparency and control Infinitely extensible — thousands of community-made nodes and integrations
Scales from creative experimentation to production automation
Open-source, used by millions, and backed by one of the most active AI communities online
Evolving to democratize visual AI creation: empowering everyone from hobbyists to studios, storytellers, and enterprises to be more productive and creative than ever before
ComfyUI isn’t just another AI app. It’s aiming to become the operating system for visual generative AI , the foundation on which the next generation of creative tools are being built.
An creative’s show case of how Comfy is adopted in their work
About Us
We are a small, intense, and well-funded team in San Francisco who push ComfyUI and its ecosystem forward. Our team comes from Stability AI and Google and many contributed to the ComfyUI ecosystem way before working here.
Our team is small and flat and there is no hierarchy, only areas of responsibilities: devs, ops, product, etc.
The only thing that matters is the quality of your cultural fit and execution. We work hard and demand a lot of each other. But we have fun: everyone is here to make something meaningful that will end up being our life’s work. If this mission excites you and you view yourself as a top-tier talent, your future latent self is waiting for you at Comfy.
Q&A
How can I increase my chances of getting the job?
What is the team culture?
What kind of background are you looking for?
What does the hiring process look like?
In-person vs remote?
What if I need visa sponsorship to work in the US?
Can I get feedback for my resume and interview?