Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
EXO Infrastructure as Code
- Overview of EXO deployment patterns: single-node, multi-node, and RDMA clusters
- Automating dependency installation (Xcode, uv, Node.js, Rust) with configuration management
- Using Nix flakes for reproducible EXO builds and developer environments
- Writing Ansible playbooks or shell scripts for unattended cluster provisioning
Reproducible Builds and CI Integration
- Pinning dependencies and building the dashboard in CI pipelines
- Running EXO smoke tests in GitHub Actions or GitLab CI runners
- Creating golden images and snapshot-based rollback workflows for macOS and Linux VMs
- Versioning custom model cards alongside application code
Cluster Discovery and Networking Automation
- Configuring mDNS and static DNS for reliable libp2p node discovery
- Automating network profile creation and Thunderbolt bridge management on macOS
- Using custom namespaces (EXO_LIBP2P_NAMESPACE) to separate dev, staging, and prod clusters
- Firewall rules and network segmentation for multi-tenant environments
Storage and Model Lifecycle Management
- Designing EXO_MODELS_DIRS and EXO_MODELS_READ_ONLY_DIRS strategies
- Mounting NFS or SAN shares as read-only model repositories for fast provisioning
- Garbage collection of stale caches and versioned weight retention policies
- Automating model pre-downloads and health checks before rolling updates
Monitoring and Alerting
- Shipping EXO logs to centralized logging (ELK, Loki, or Splunk)
- Building Grafana dashboards from EXO_TRACING_ENABLED output
- Alerting on cluster membership changes, OOM events, and inference latency spikes
- Correlating macmon hardware telemetry with model performance regressions
Update, Rollback, and Disaster Recovery
- Staging EXO binary updates in a canary node before fleet-wide rollout
- Model-level rollback: switching between quantized versions without re-downloading
- Backing up and restoring cluster state, custom namespaces, and cached weights
- Documenting recovery runbooks for total cluster rebuild scenarios
Security Hardening and Compliance
- Applying TLS at the reverse proxy layer (nginx, traefik) for the dashboard and API
- Implementing API rate limiting and IP whitelisting for EXO endpoints
- Isolating clusters with VLANs and zero-trust network policies
- Auditing access and maintaining an inventory of deployed models and versions
Requirements
- Experience with DevOps practices (CI/CD, IaC, container orchestration)
- Familiarity with macOS or Linux system administration and package management
- Understanding of networking, DNS, and storage concepts
Audience
- DevOps engineers
- Infrastructure architects
- SREs responsible for on-premise AI workloads
21 Hours
Testimonials (2)
The knowledge and experience of the consultant, as theoretical topics are addressed by applying them to the reality of processes. The course contains a highly valuable program in information technology management.
Luis Castro Gamboa - Cooperativa De Ahorro Y Credito Ande No. 1 R.L.
Course - Site Reliability Engineering (SRE) Foundation®
Machine Translated
That it was very clear in each specification
Ricardo Ramirez - AMX CONTENIDO
Course - DevOps Leader (DOL)®
Machine Translated