Please note: This schedule is for OpenStack Active Technical Contributors participating in the Icehouse Design Summit sessions in Hong Kong. These are working sessions to determine the roadmap of the Icehouse release and make decisions across the project. To see the full OpenStack Summit schedule, including presentations, panels and workshops, go to http://openstacksummitnovember2013.sched.org.
The Ironic service must be able to tolerate individual components failing. Large deployments will need redundant API and Conductor instances, and a deployment fabric with no SPoF. Ironic's current resource locking uses the database for lock coordination between multiple Conductors, but only a single Conductor manages a given deployment. There are several things we need to do to improve Ironic's fault tolerance.
Let's get together and plan development of the ways in which we can: * recover the PXE / TFTP environment for a managed node, when the conductor that deployed it goes away; * set reasonable timeouts on task_manager mutexes; * break a task_manager's mutex if the lock-holder is non-responsive or dies; * distribute deployment workload intelligently among many conductors; * route RPC requests to the Conductor that has already locked a node; * route RPC requests appropriately when multiple drivers are used by different Conductor instances.