Several months ago Bill Liu reached out to me about an opportunity to speak at the Cloud+Data Next conference that was held on Sunday. After some discussion, we settled on me giving a talk on “Day 2 Operations Of Cloud-Native Systems” in the DevOps/SRE track, but there were many opportunities here based on the work we do. Tracks at this event covered containers, machine learning, data infrastructure and analytics and more, all of which we put strategic effort into supporting on the DC/OS platform.
The conference was held at the Santa Clara Convention Center. It may come as a surprise, but in spite of working in tech, attending a lot of conferences AND having lived here for over seven years, it was my first time at that convention center. I’ve worked so long from home so I shy away from long commutes, and for me it’s generally easier to attend a conference in somewhere like Seattle where I have a hotel instead of commuting down into the valley. Sunday made this an easier prospect, since I wouldn’t have to deal with rush hour traffic. A stroke of luck also meant that MJ was traveling for work this weekend, so I didn’t need to take the train (a 2+ hour ordeal) or leave him without a car as I took the only one we have out here for the day.
The conference format consisted of a series of keynotes in the morning, with tracks in the afternoon. The first keynote of the day was from Craig McLuckie, the CEO of Heptio and one of the founders of the Kubernetes project. His talk was titled “Architecting for sustainability: micro-services and the path to cloud native operations” and he began by talking about the balance between fast development and innovation and having a perfectly stable system. His talk also covered the shift to more specialized operations roles, with staff covering infrastructure, cluster, common services and specific application operations. He argued that with specialization comes more automation, as you end up with a shift from generalists who are doing various tasks rarely, to specialists who will quickly tire of doing the same tasks over and over again, and thus automate them. It also leads to better monitoring and more as you have people who really understand the systems that they’re working with.
I also enjoyed that he talked about the importance of Mean Time To Recovery (MTTR), rather than attempting to build a system that never fails. Systems will fail, people will make mistakes. It’s our job to make sure that when things fail and people make mistakes, we can recover from them and move forward. Containers were a key part of his message here, with immutable images being created in development, tested and deployed. When your development and testing environments look similar to production, a lot of problems go away.
With my own talk later in the day about what to look for when finally going to operate that cloud-native system driven by microservices, it was really nice to have this talk ahead of mine to provide such a great place for the audience to start with. “Remember that keynote this morning? Here’s what you need to consider when you’re finally running it: metrics, logging, debugging, recovery and backup tools…”
The next keynote was from Jeff Feng on “How Airbnb Does Data Science” where he talked about the value of experiments in development of their product. Using examples from changes that Airbnb had made to the website UI, he walked us through how data was used when doing UI change experiments. How will a change in wording on a button impact bookings? What types of customers respond better to what types of photos in advertising their offerings? I learned that in order to most effectively use and learn from these experiments, they have a data scientist embedded on every team who can help them isolate, review and analyze data from experiments to make sure changes don’t have negative or unintended consequences. He concluded his talk by talking about their internal Data University where employees can learn how to data can impact their own work and be introduced to the tooling to make that happen. During his talk he mentioned a TechCrunch article about it, which I dug up after his talk, Airbnb is running its own internal university to teach data science, as well as a Medium.com post that he co-authored which dives into it a bit more deeply, How Airbnb Democratizes Data Science With Data University
My talk took place after one from Todd Palino who talked about “Holistic Reliability: SRE at LinkedIn” which covered their approach to SRE. I’m very familiar with the role of SRE, so the general concepts around it weren’t new to me. However, it was nice to learn about their practices around common operations repositories so others can pitch in and “help themselves” much like I was familiar with in the OpenStack project. It was also nice to learn how their company mindset was open source first. I’ve known for some time that Apache Kafka (which he happens to work on) came out of LinkedIn, but I learned that this was by no means an anomaly. He talked about how there is often a push for open sourcing when a valuable new tool has been developed internally.
The talk I gave on “Day 2 Operations Of Cloud-Native Systems” (slides here) was an evolution of a talk I gave at Seattle DevOps Days back in April. The talk now includes a more formalized checklist of things you need to build into your evaluation and building of your cloud-native system. After talking with folks in Seattle, one of the most important takeaways that I stressed with the talk this time was that everything I talked about (monitoring, logging, backups and debugging) are the things that are taken for granted and it’s rare that there is enough time spent on making sure they are of high enough quality to serve the operations team when an outage eventually occurs. My hope is that the checklist I added will help give some direction around these “unwritten” necessities as they shift into a world that, even in it’s simplest form, has load-balanced applications, running on containers, on clusters, on some underlying hardware or cloud. That’s a lot of layers to keep track of when something goes wrong.
After my talk I attended one from Nell Shamrell-Harrington of Chef. I happened to run into her earlier in the day and realized at lunch that she was also a speaker at Seattle DevOps Days this year! There she was talking about DevOps in Politics, but at this conference she was giving a talk on “Platform Agnostic and Self Organizing Software Packages” where she talked about the project she’s working on at Chef, Habitat. The promise of Habitat is the ability to build and deploy applications across many platforms consistently. It’s interesting to have a look at so I suggest you do, but what impressed me the most about her presentation was that her demos actually walked the walk. Her first demo showed configuration and installation an app in Ubuntu on Azure, and the second an app on Red Hat on AWS. This may seem like a small detail, but platform agnosticism is hard to do right, and when a project is as young as Habitat is you often see only a single type of demo that has been fine-tuned to work perfectly in one place. Kudos to the team for casting a wider net to impress us early on!
In all, I enjoyed the event. The ticket price was lower than some others covering these topics, so it felt like there was an opportunity for a more diverse audience of people skill-wise to attend. I met a student who was looking into getting into data analytics, a systems administrator who was seeking to move more of his skill set, and eventually work, into the area of data. Several other people I spoke with were working in software development, but curious about where the latest data-drive technology was going and were willing to give up a Sunday to learn more. I may keep an eye out for future events in the area since it may be a nice opportunity.
More photos from the event can be found here: https://www.flickr.com/photos/pleia2/albums/72157683129676822