I recently attended my first MesosCon, the North American edition hosted in Los Angeles. There are three such events per year, the other two held in Asia and Europe. These events bring together various companies and other organizations working with and on Apache Mesos so I was really eager to finally meet some of the folks I’ve only interacted with online.
The event began on Wednesday with a Mesos community Hackathon and a customer-focused DC/OS Day run by Mesosphere. Given my role and background, I joined the Hackathon where my colleague Jörg Schad kicked off the event by encouraging attendees to collect ideas for projects they wanted to tackle throughout the day. Topics ranged from documentation improvements, creation of a Kubernetes-related demo for SMACK, work on frameworks and further work on an autoscaler for several cloud platforms where DC/OS runs. Plus, there were mini cupcakes!
The main event for MesosCon began Thursday morning with an introductory keynote from Ben Hindman, Co-Creator of Apache Mesos (and my boss), where he covered some of the latest features in Mesos over the past year that he’s most proud of. These included the introduction of nested container support, adherence to standards through the Open Container Initiative project and beyond, and general expansion in usage and community. Mesosphere Co-Founder Tobi Knaup also got on stage, joined by Tim Hockin of Google for a demonstration of the new beta availability of Kubernetes on DC/OS. One of the really distinctive things about this implementation is that it’s running pure upstream Kubernetes, among other things, this allows for migration of a pure Kubernetes environment (on Mesos or not) as well as known functionality with the tooling written today for pure upstream Kubernetes.
Ben Hindman sharing new and noteworthy improvements to Mesos
These talks were broken up by a pair of panels, the first talking about Mesos in the Enterprise with Michael Aguiling at JPMC, Larry Rau (Verizon), Cathy Daw (Mesosphere), Stefan Bauer and Hubert Fisher (Audi), and Josh Bernstein from {code} as a moderator. At the enterprise level, a lot of interest was around distribution of resources across cloud vendors and on-prem, the introduction of open source software in companies unaccustomed to it, general scaling needs required by large companies and a standards-driven future so they can continue to integrate container technologies into their infrastructure.
The final keynote panel of the morning had Ben Hindman sitting down with Neha Narkhede (co-founder and CTO of Confluent) and Jonathan Ellis (co-founder and CTO of Datastax) for a fascinating discussion about the role of Apache Cassandra and Apache Kafka in the fast-data driven world powered increasingly by the SMACK (Spark, Mesos, Akka, Cassandra and Kafka) stack. The main message in the panel was how fast data is essential in the interaction with customers today, and their Internet of Things gadgets that create an expectation of immediate response. This panel was also where I heard a theme that would be repeated as the conference progressed: people move to microservices incrementally. There are a few ways of moving an environment which is microservices driven, but whether they convert old systems slowly or adopt a policy that all new projects must be built with them, it’s incredibly common across the industry. As a developer advocate these days, it was also nice to hear from Neha that the open source nature of the components are empowering developers, which makes my role in getting out there to talk to developers and operations folks who are on the ground with the technology particularly important.
The keynotes concluded and we dispersed into three tracks, on SMACK, Ops, and DevOps, and a new addition to the conference called MesosCon University where attendees could attend an 80 minute hands-on session on making applications production-ready, securing Mesos clusters, or building stateful services.
I was the track lead for the DevOps track along with Julien Stroheker from Microsoft. Leading up to the conference Julien and I reached out to the speakers to get slide decks and answer any questions, then the two of us met up to chat about the track in San Francisco the week before. The day of the track everything went smoothly, with Mark Galpin of JFrog getting us started by talking about the development and deployment of three key projects: Artifactory, Bintray and Xray. They learned quickly that the application and configuration layers had to be configured separately and that it was best to plan for enterprise and high-availability early on, rather than trying to build it in later. He also discussed importance of good startup scripts and package management in a microservices world, since you don’t simply use an RPM anymore to install things… but you do still need RPMs because not every customer will want to adopt a microservices environment just yet! Standardization was also important, customers wanted the ability to hook into their existing storage and network backends, and often had unusual requirements (for example, “my nodes have no outbound internet access”). He concluded by stressing that everyone in the DevOps organization should be involved from the beginning when working to develop the products and how they’re being deployed so that they consider every aspect of how the tooling will be used and deployed.
The next talk was by Aaron Baer from Athena Health who brought us deep into the world of healthcare information services where monoliths rule and shared his experiences working to bring new technologies build on microservices into the fold. Like I heard earlier in they keynote panel, he’s been working on a path to incrementally move from monolithic and relatively inflexible services over to microservices. In addition to building new infrastructure where they can begin using containers, they’ve also been using their legacy database as a canonical resource and storage, and doing data processing and delivery through faster, more modern data storage and processing tools. He’s also a proponent of code as infrastructure (me too!) so it was great to hear him talk about how essential API-driven infrastructure engineering and automation is to getting our infrastructures to the next level.
Aaron Baer on the progression of monolith to microservice
After lunch the track continued with Chris Mays and Micah Noland from HERE Technologies sharing details about the Deployment API they built for DC/OS. They wanted a simplified version of the core DC/OS API which was deployment-centric and needed an open source option that would improve upon Marathon-LB. The API they’ve developed has just recently been open sourced and can be found here: https://github.com/heremaps/deployment-api. During the talk Micah provided an in depth tour of how straight-forward their YAML configurations were for a sample deployment of a Jenkins pipeline. The API theme continued in the next talk by Marco Palladino of Mashape, the makers of Kong. Kong is an open source gateway for APIs which allows you more control over access, including security, authentication and rate limiting. He stressed how important this whole ecosystem of support is for microservices, and his talk also touched upon the steady migration needed by most organizations from their old stack to the new, which also meant keeping a lot of pieces like their legacy authentication system intact but that adding too many layers can leave to unforgivable latency.
Lively booth area break before the final two sessions of the day
The day concluded with talks from Will Gorman of Cerner on Spinnaker and Imran Shaikh of YP on hybrid clouds. I enjoyed the entire DevOps track, but these two talks really went to the heart of what I’ve focused on infrastructure-wise in the past five years. Give my CI/CD background, I’ve known about Spinnaker for some time, but never made diving into it a priority. This talk inspired me. As you may have guessed, Spinnaker is a continuous delivery platform and Will began his talk by covering the basics about it before getting into integration with DC/OS, including the ability to hook into Metronome for jobs. He also touched upon controls Spinnaker has for the deployment phase to complement Marathon’s own attempts to deploy and restart processes. I’ll definitely be digging into it on my own very soon. Imran’s talk on the importance of being responsible for your own infrastructure was very satisfying for me. He led his talk by talking about how moving to the cloud doesn’t absolve you of the responsibility of making sure you have a good disaster recovery plan, reminding the audience that your business relies upon a good failover plan, and your cloud provider has different goals. Throughout his talk his talking points revolved around vendor independence, distributing risk and the importance of your processes being portable to other solutions.
Will Gorman showing how to execute a Metronome job as a step in a Spinnaker pipeline
Thursday evening had another MesosCon first, Town Hall sessions. A big hit with attendees, these were casual evening gatherings on various topics (Mesos, DC/OS and Marathon/Chronos). They allowed community members to gather and talk about whatever they wanted, swap war stories, share tips, anything! Sadly I had to miss this, participation in the Open Source Summit earlier in the week and with a panel to prep for, I was running low on energy. I’m looking forward to attending the ones we host at MesosCon EU coming up at the end of October.
Friday morning came quickly, and I’ll let you in on a little secret: I had breakfast with my Future of Cluster Management panelists. The panel was made up of Sharma Podila (Netflix), Sam Eaton (Yelp), Ian Downes (Twitter) and Zhitao Li (Uber). As the panel moderator I had worked with them on questions and spoke with them via video conference, I hadn’t met any of them in person. A casual, leisurely hotel breakfast was a great way to break the ice and answer any lingering questions folks had, including myself (“none of you are going to surprise me with the answer you give about the importance of open source, right?”). I highly recommend such a gathering for other moderators if it’s possible, I’ve been a panelist a few times and on stage ice-breaking is not always the best way to go.
Jörg was the MC for day two of keynotes, and after a community-focused introduction to this second day (the first day was more enterprise-focused), he brought us on stage. Talking about the future is always fun, and this panel was no exception. After introductions, we covered the importance of open source in our ecosystem, something that is near and dear to my heart. Theme-wise we talked about the move from perfecting deployments on containerized systems to a focus on maintenance, and the tools they’ve had to built outside of what Mesos provides to manage their clusters. The question of how they handle mixed workloads (stateless vs. stateful, batch vs. long-lived) and what improvements could be made here. The conclusion centered around the future of artificial intelligence of clusters, which led us to learn that some of them are already using machine learning to teach their clusters about load and behavior, which was later explored in a track talk from folks at Twitter who are doing automated performance tuning using Bayesian optimization. The last question I had for the panelists was what they wanted to see in 5-10 years, and their answer was identical: they don’t want to care about the underlying tech, they just want their workloads to run.
Keynote panel: Future of Cluster Management, thanks to Julien Stroheker for the photo (source)
The morning keynotes continued with a talk from folks at the National Geospatial-Intelligence Agency that spoke to how they’re using Mesos at scale inside of their organization. Keynotes concluded with one from Ross Gardler from Microsoft who explored the question of where we were in the hype/actual production cycle for microservices and containers. He expressed that there’s still a lot of diversity in the field, and that with so many early deployments there’s no clear “winner” in the space of containerization. Today, he shares, the focus should be sharing best practices and methodologies across platforms so that all of the next iterations of our various containerization platforms can be that much better.
Just like Thursday, after keynotes we shuffled off into various tracks, with Friday featuring tracks on Mesos Internals, Ops, SMACK and the one I took over as track lead for, Mesos Frameworks. My focus in my role at Mesosphere had largely been on higher level operations, so it was a very interesting track for me to sit in on for the day. There’s a lot of really interesting work being done in the Mesos ecosytem that I’d love for my team to be a part of drawing more attention to. This day was great for that.
The first talk was on the juggling that happens around optimizations, service guarantees and how those trade-offs manifest by Sharma Podila of Netflix, who I’d just met earlier in the keynote panel. He began by asking some questions about your cluster and explaining the heterogeneous nature of the hardware they have at Netflix, you never know which generation hardware your workload is going to land on. He then dove into the challenge of scaling down, showing off the open source Fenzo, “a scheduler Java library for Apache Mesos frameworks that supports plugins for scheduling optimizations and facilitates cluster autoscaling.” His talk was followed by one from Wil Yegelwel for TwoSigma who joined us to talk about simulation testing. In spite of my deep interest in CI/CD, simulation testing is something that’s largely escaped my radar. Thankfully he gave an introduction to the concept and then shared how they used the scheduler they built at Two Sigma, Cook, to intelligently handle scheduling of workloads. The intersection of an intelligent scheduler and a policy of simulation testing meant they could run tests that changed configurations and see how they could optimize things as much as possible, and he explained that this has sometimes resulting in some surprising, non-intuitive discoveries that boosted performance. He also credited simulation testing with helping engineers become more familiar and comfortable with the infrastructures they’re running, which are growing increasingly complicated and difficult to fully understand.
After lunch we dove right back in with Joshua Cohen and Ramki Ramakrishna from Twitter who came to speak on the aforementioned Bayesian optimization talk. It’s well known that Twitter’s infrastructure is built on a lot of microservices. At that scale, they explained, manual performance tuning of hundreds of JVM options doesn’t scale, is error-prone, time-consuming and honestly leads to goal-driven engineers copying existing configs to get going, without fully understanding why the optimizations exist in specific places. They went on to stress that even if you do manage to get everything running well, it’s effectively undone as soon as you change or upgrade any component. Instead they shipped the problem off to an internal machine-learning driven technique that uses Bayesian inference to optimize performance. Bayesian became popular in my own operations world for spam analysis, but the methodologies as I understand them make a lot of sense here as well. Unfortunately they haven’t open sourced the Bayesian Optimization Auto-Tuning (BOAT) tooling, but knowing that it’s been done and is successful for them is a good start for others looking to do similar machine-learning inspired performance tuning.
Next up was Tomek Janiszewski from Allegro who gave us 8 tips for Marathon Performance, summing them up:
- Monitor — enable metrics
- Tune JVM
- Optimize Zookeeper
- Update 1.3.13
- Do not use the event bus
- Use a custom executor
- Prefer batching
- Shard your Marathon
Marathon is a popular project in the Mesos ecosystem, so the room was quite full for this talk and the questions about it went well into the coffee break we had between sessions!
After the coffee break we were in the home stretch, only two more talks to go before MesosCon concluded! First up in this final pair was Dragos Dascalita Haut and Tyson Norris from Adobe Systems who where sharing details about their use of OpenWhisk as a Mesos framework. They quickly covered OpenWhisk, an event-driven serverless platform for deploying code. Their lively presentation was full of demos and they shared their Akka-based mesos-actor project before pulling in the OpenWhisk bits that brought together all the pieces of the work they were doing.
The final talk brought a pair of engineers from Verizon Labs, Tim Hansen and Aaron Wood, to share their work on fault-tolerant frameworks without Docker. They share some of my own opinions about the over-use of Docker where something slimmer would suffice, so I knew going in that I’d enjoy this talk. The talk explained why using Docker and the entire Docker daemon was often overkill for their work, specifically Tim saying that there was “no real need for an extra daemon and client when Mesos can containerize tasks.” So they instead directed the audience toward use of the Mesos containerizer and the CNCF’s Container Network Interface (CNI). To this end they developed Hydrogen, “High performance Mesos framework based on the v1 streaming API” and their own Mesos SDK, a general purpose Golang library for writing mesos frameworks.
As always, as tiring as the week was it was still sad to see it conclude. It was great to meet up with so many people, huge thanks to everyone who take time to let me pick their brain, or simply sat down for a quick chat. I learned a lot while there and had a wonderful time.
Clockwise from top left: With Yang Lei (IBM), Dave Lester (Apple), Mesosphere community team (minus Jörg, plus Ravi!), and Aaron Baer
More photos from MesosCon NA 2017 here: https://www.flickr.com/photos/pleia2/albums/72157689480391705