Skip to content

General information

Career area
Technology
Work Location(s)
500 Woodward Avenue, MI, 601 S. Tryon Street, NC
Remote?
No
Ref #
22455
Posted Date
06-24-26
Working time
Full time

Ally and Your Career

Ally Financial only succeeds when its people do - and that’s more than some cliché people put on job postings. We live this stuff! We see our people as, well, people - with interests, families, friends, dreams, and causes that are all important to them. Our focus is on the health and safety of our teammates as well as work-life balance and diversity and inclusion. From generous benefits to a variety of employee resource groups, we strive to build paths that encourage employees to stretch themselves professionally. We want to help you grow, develop, and learn new things. You’re constantly evolving, so shouldn’t your opportunities be, too?

Work Schedule: Ally designates roles as (1) fully on-site, (2) hybrid, or (3) fully remote. Hybrid roles are generally expected to be in the office a certain number of days per week as indicated by your manager. Your hiring manager will discuss this role's specific work requirements with you during the hiring process. All work requirements are subject to change at any time based on leader discretion and/or business need.

The Opportunity


At Ally, you get a startup feel, but experience the benefits of a company that has worked out the kinks and is fulfilling its purpose. We are always evolving and see that as a good thing. From owning our work to seeing its impact in the real world, our team is relentless in finding new ways technology can help make experiences better and help people. We are problem solvers, we value diverse thinking, we support one another, and we challenge ourselves to think bigger in the journey to deliver customer-obsessed tech solutions. To read more about what our tech team does, be sure to visit our tech blog at ally.tech

You will bring SRE practice to the AI agent ecosystem — defining and enforcing production readiness standards, building the observability and alerting layer, running the readiness gate for every service before it goes live, and owning the incident response process when things fail. You will also build and operate the SRE agents themselves: the automated tools that run production readiness checks, generate post-incident reviews, monitor SLO burn rates, and surface reliability findings before a deployment proceeds.

This is not a traditional SRE role watching dashboards. You are an active builder. The SRE toolchain here is itself an agent-driven system — you will extend it, maintain its knowledge core content, and use it to enforce standards across every team in the program.

At this time, Ally will not sponsor a new applicant for employment authorization for this position.

The Work Itself

Production readiness and SLO ownership

  • Run the 10-point production readiness gate for every Lightspeed and Logos service before first production deploy — SLOs defined, runbook exists, alerting configured, rollback documented, on-call assigned
  • Define and maintain Dynatrace SLOs for AI-powered services; configure burn-rate alerting (multi-window, aligned to user impact)
  • Own the error budget policy: track consumption, flag services approaching exhaustion, enforce the deployment freeze when budgets are gone

Observability for AI workloads

  • Instrument AI agent pipelines with structured JSON logging (traceId, spanId, correlationId), custom metrics, and distributed traces
  • Build Dynatrace dashboards for AI services: request rate, error rate, latency P50/P95/P99, dependency health, agent invocation counts and failure rates
  • Identify and address the observability gaps that make AI system failures hard to diagnose — context truncation, tool call failures, model timeouts, partial completions

SRE agent development and maintenance

  • Own the sre-gate, sre-monitor, sre-pir, sre-remediation, and sre-validation agents — keep their behavioral rules, domain context, and integration patterns current
  • Maintain the SLO definitions, domain team map, incident classification, and runbook location config in the knowledge core (domain.md, architecture.md)
  • Extend the SRE reliability anti-pattern checklist as new failure modes are identified in AI-powered services

Incident management and PIR

  • Lead post-incident reviews for production AI service failures; generate structured PIRs and track remediation items through to closure in Jira
  • Monitor for recurring patterns (same root cause three or more times triggers systemic pattern review); drive elimination of systemic issues
  • Maintain on-call rotation coverage across the service portfolio; run escalation path validation before each new service goes live

Reliability design review

  • Review PRs and architecture designs for AI services against the reliability anti-pattern checklist before merge: retries, circuit breakers, health endpoints, graceful degradation, rollback plans
  • Issue BLOCKING vs ADVISORY findings with specific file and line citations
  • Work with developers to resolve BLOCKING findings before they reach production

The Skills You Bring

Minimum Requirements:

  • 3+ years of experience
  • High School Diploma or GED equivalent

Preferred Requirements:

  • 4+ years in SRE, platform engineering, or production operations with direct ownership of production services
  • Deep familiarity with SLO/SLI/error budget concepts — not just the theory, but day-to-day operation in a real environment
  • Experience with an observability platform at the level of configuring alerts and dashboards, not just reading them (Dynatrace, Datadog, New Relic, or equivalent)
  • Strong Python or TypeScript — you write and own automation, not just runbooks

Strongly preferred

  • Experience instrumenting distributed systems for observability: structured logging, metrics, distributed tracing
  • Familiarity with the Anthropic Claude API or Claude Code — understanding what makes AI agent systems fail differently from conventional services
  • Experience running post-incident reviews and tracking systemic remediation to completion
  • Confluence and Jira integration experience — the SRE toolchain here publishes directly to both
  • Financial services or other highly regulated environment experience

What sets the best candidates apart

  • You have written the alerting rules and SLO definitions, not just responded to the pages they generate
  • You understand the specific failure modes of AI systems: non-determinism, context limits, tool call failures, latency spikes from model inference — and you know how to build observability that surfaces them
  • You treat reliability as a shared engineering discipline, not an ops function — you work with developers before code merges, not after pages fire
  • You have maintained runbooks that were actually used during incidents, which means they were accurate, concise, and kept current

How We'll Have Your Back

Ally's compensation program offers market-competitive base pay and pay-for-performance incentives (bonuses) based on achieving personal and company goals. Our Total Rewards program includes industry-leading compensation and benefits plus additional incentives that are designed to meet your needs and those of your family so you can get the most out of your career and your life, including:

  • Time Away: Program starts at 20 paid time off days in addition to 11 paid holidays and 8 hours of volunteer time off yearly (time off days are prorated based on start date and program varies based on full or part-time status and management level).
  • Planning for the Future: plan for the near and long term with an industry-leading 401K retirement savings plan with matching and company contributions, student loan pay downs and 529 educational save up assistance programs, tuition reimbursement, employee stock purchase plan, and financial learning center and financial coach access.
  • Supporting your Health & Well-being: flexible health and insurance options including medical, dental and vision, employee, spouse and child life insurance, short- and long-term disability, pre-tax Health Savings Account with employer contributions, Healthcare FSA, critical illness, accident & hospital indemnity insurance, and a total well-being program that helps you and your family stay on track physically, socially, emotionally, and financially.
  • Building a Family: adoption, surrogacy and fertility assistance as well as paid parental and caregiver leave, Dependent Day Care FSA back-up child and adult/elder care days and childcare discounts.
  • Work-Life Integration: other benefits including Mentally Fit Employee Assistance Program, subsidized and discounted Weight Watchers® program and other employee discount programs.
  • Other compensations: depending on the role for which you are considered, you may be eligible for travel allowances, relocation assistance, a signing bonus and/or equity.
  • To view more detailed information about Ally’s Total Rewards, please visit this link: https://www.ally.com/content/dam/pdf/corporate/ally-total-rewards-snapshot.pdf
 

Who We Are:

 

Ally Financial is a customer-centric, leading digital financial services company with passionate customer service and innovative financial solutions. We are relentlessly focused on "Doing it Right" and being a trusted financial-services provider to our consumer, commercial, and corporate customers. For more information, visit www.ally.com.

 

Ally is an equal opportunity employer committed to diversity and inclusion in the workplace. All qualified applicants will receive consideration for employment without regard to age, race, color, sex, religion, national origin, disability, sexual orientation, gender identity or expression, pregnancy status, marital status, military or veteran status, genetic disposition or any other reason protected by law.

 

We are committed to working with and providing reasonable accommodation to applicants with physical or mental disabilities. For accommodation requests, email us at hrpolicy@ally.com. Ally will not discriminate against any qualified individual who is capable of performing the essential functions of the job with or without reasonable accommodation.

Base Pay Range: $85000 - $150000 USD
An individual's position in the range is determined by the specific role, the scope and responsibilities of the role, work experience, education, certification(s), training, and additional qualifications. We review internal pay, the competitive market, and business environment prior to extending an offer. 
Incentive Compensation: This position is eligible to participate in our annual incentive plan.