Scaling design: What does scaling mean at the tripleo level; where do we do well, where are lacking, what do we need to design? For both the deployment activity itself as well as the deployed cloud... (Session proposed by Robert Collins) * whats manual in a TripleO deployment? * Adding nodes to Heat template * tailing log files - identifying progress / status (we automate some of this but it is clunky) * Recovering from failures (related bug https://bugs.launchpad.net/heat/+bug/1160052) * Also see https://etherpad.openstack.org/p/icehouse-summit-heat-convergence * registering of nodes * manual driving of the API * whats automated but slow * initial keystone configuration * bootstrapping * maybe (for bootstrap images?) build a blank nova etc db durind image build time * 60 seconds latency in nova for registered nodes to been seen by the scheduler * register with scheduler from API server * All services report just one waitcondition; tuskar would like to start configuring them as each becomes usable? Perhaps a monitoring check for 'keystone is usable' ? * baremetal deploy helper is throttling to 1 deploy at a time * may be race conditions * crowbar says 'max of 10 at a time' * isolated network (avoid production traffic impact) * Need knobs? And a steampipe. * where will we hit scaling / perf limits * Network * Disk IO * page cache on glance/nova compute[ironic] * Database issues * We need tools for measuring * systematic measuring needed: * latency to deploy * latency to build * where the time goes * Maybe put collectd + graphite in undercloud * Maybe put logstash + elastic search in undercloud * Capture data from: * apis * deploy ramdisk * perhaps deployed instances -> logstash etc? ACTIONS: * Ensure Ironic does not suffer from node registering latency * Fix nova-bm too if possible - should be basically the same fix * Starting to collect details here: https://bugs.launchpad.net/nova/+bug/1248022 * Test and tune baremetal disk image copying * Gather data crucially; aim to minimise average per-node time to deploy, not average time to deploy all nodes. The former gives lower latency for post-deploy work to start * Experiment with (collectd + graphite) * Tiny ramdisk version of collectd-client using /dev/udp ? * Capture progress of db init etc to a collected? * Add Logstash and/or elasticsearch to undercloud for instrumentation