Amid digital transformation and ever-worsening cyber attacks, shoring up the ops in DevSecOps has risen to the top of the SRE priority list.
As DevSecOps developed, IT pros focused first on “shifting security left” to the software development stage of the DevOps process and making applications more secure by design. But recently, site reliability engineers (SREs) have also devoted more of their attention to “shifting right” — automating security monitoring in production and contributing feedback based on security monitoring data back to developers.
The idea that security is everyone’s responsibility isn’t necessarily new, but it’s taking on increased urgency for SREs who spoke at this week’s SRECon.
“The number of CVEs [common vulnerabilities and exposures] opened every year has increased by over 400% since 2010, [and there was a] gigantic jump between 2016 and 2017,” said Adam Debus, staff SRE at professional social network LinkedIn, in a conference presentation, citing statistics from the National Institute of Science and Technology. “The entire world was starting their cloud migrations … this introduced a ton of security vulnerabilities as people reworked their paradigms.”
In 2019, as LinkedIn entered the later stages of growth from 65 million to 800 million members and began migrating its infrastructure to Microsoft Azure, Debus said it became clear that security and compliance were areas where SREs could play a crucial role.
“We focus a lot on the Rs in SRE — reliability, resiliency — but I proposed that we can’t have a reliable or resilient environment unless the infrastructure that we’re building for it is secure,” Debus said.
LinkedIn SREs streamline OS testing, boost security
Debus spotted an opportunity to reduce toil — tedious manual work that can lead to burnout — during another migration LinkedIn made in 2019, away from standard distributions of the Linux kernel toward custom versions optimized for performance. The new kernels had to be tested to ensure compliance with LinkedIn’s security policies throughout its cloud infrastructure, which would be a massive undertaking. The number of LinkedIn members grew 1,200% from 2010 to 2022, but the number of infrastructure resources under management for the company grew a disproportionate 6,000% in that same period. At the same time, the company used various manual processes for this kind of security certification testing, which would be untenable for such a large and complex project.
Adam DebusStaff SRE, LinkedIn
“You have dozens, if not a hundred people, especially in an environment the size of LinkedIn, who are consistently having to drop what they’re doing in their development work to certify a new OS, and that’s not sustainable,” Debus said.
Debus’ SRE team created a standardized, automated process for the company to reduce this toil and ensure that both security certification and performance testing were comprehensive before new OS images were deployed to production. The new process began with testing OS snapshots on the dozens of hardware configurations the company uses in its on-premises and cloud deployments, followed by third-party drivers. The process then hardened the OS according to the company’s standard security policies and scanned for known vulnerabilities, before the OS snapshot was put through LinkedIn’s DevOps pipelines and configuration management processes, tested in various programming language environments and, finally, tested alongside the company’s applications.
Now, SRE team members maintain the automation framework for this testing, and developers plug their own tests into it, Debus said.
“It’s a symbiotic relationship,” Debus said of his team’s work with developer teams to create the testing system. “We went to them and said, ‘Look, we’ve developed a really flexible automation platform and framework that you can use to write your tests — you can do anything you want; you just need to call these three functions.’”
In the nine months that the testing framework has been used in production, it has reduced the total time to certify a new OS snapshot by 70%, Debus said.
“But that’s only the tip of the iceberg,” he said. “It’s the same testing process over and over again, which means it’s defensible, which makes information security happier. And we’ve also added new testing paradigms, taking a lot of the analysis out of a human need to look at [a test result] and decide if it’s good or not.”
EBPF appears on SREs’ radar as a tool for DevSecOps
Other presenters at SRECon echoed Debus’ reflections about the increasing focus on security in their work.
“Right now, if you run a global service, you’re probably seeing a lot of traffic from network segments that might be unfriendly, and so you have to figure out how to deal with that,” said Shaun Mouton, principal software engineer at Mastercard, in conference presentation.
Some, including Mouton, also discussed Extended Berkeley Packet Filter (eBPF). The tool has gained popularity in cloud-native circles for microservices networking and observability over the last two years, and could be potentially valuable for SREs in DevSecOps environments as well.
Companies often need to know whether software supply chain attacks have changed the behavior of applications on the network. That can be difficult to detect, especially in the case of legacy apps whose developers no longer work for the company, or third-party apps that don’t allow access to their source code. Mouton’s presentation demonstrated the ways eBPF could be used to build profiles of typical behavior by such applications, including monitoring which files they typically access, to detect suspicious changes in behavior.
“In my experimentation, I found that eBPF is fantastic for point-in-time snapshots of the system,” Mouton said. “It’s well developed toward streaming events based on queries like, ‘Show me how many new processes are being forked on this system,’ or ‘Tell me about interesting aspects of my HTTP traffic.’”
Where eBPF is less effective so far is tracing the behavior of applications over time, especially as code is changed, or forked, Mouton said.
“You can attach to a running process with eBPF pretty easily,” he said. “But following the process from invocation to close … is a little bit more challenging with eBPF in its current state.”
Overall, eBPF is still a relatively new tool for uses outside of basic network routing functions outside Linux VMs — for example, New Relic’s eBPF-based auto-instrumentation tool for Kubernetes observability, Pixie, has only been available for about a year. IT practitioners and vendors are still exploring how to use updates to the upstream Linux kernel that tie eBPF in with Linux Security Module (LSM) framework hooks.
“Over the next couple of years, I see a lot of people latching on to the LSM functionality to build products to ensure that systems are safe and the programs that we run on our systems aren’t being compromised,” said Michael Kehoe, senior staff security engineer at Apache Kafka commercial vendor Confluent, in an SRECon presentation. “There is deep integration [for eBPF] with the Linux security modules to go and get real-time, rich security data.”
Beth Pariseau, senior news writer at TechTarget, is an award-winning veteran of IT journalism. She can be reached at [email protected] or on Twitter @PariseauTT.