Our next generation firewall project
December 7, 2023
December 7, 2023
Early last year, ngrok started a project to revamp its firewall system to better protect its infrastructure and, by extension, its customers. In this blog post, I'd like to share a few high-level goals we set when designing this new system called Firewall Next-Generation (or FWNG), the technical bits that make it work, and introduce the Firewall Toolkit that we built along the way, and opened the source to the community.
Our previous firewall system had a data center/regional view of traffic - it operated independently in each region. It collected network telemetry via application metrics within an individual region, generated iptables rules, and applied them on each host. If a network attack targeted hosts across multiple ngrok regions, each region would need to gather, analyze, and mitigate the attack independently.
While not having to coordinate between regions/data centers did have the significant advantage of reduced complexity, ngrok's ongoing global network initiatives made it critical to have a global view of traffic. As one of our engineers put it, "Now that customer traffic is globally distributed, the DDOS's are too. " As a simple example, with ngrok's global network, if a geographically distributed botnet were to send traffic to one of ngrok's "global" domains, each node in the botnet would send traffic to its geographically closest ngrok region.
Instead of aggregating and analyzing telemetry locally within an ngrok region, FWNG aggregates telemetry in our controlplane. The various telemetry sources in a region send data to a local-to-the-region sink, which is then replicated to the controlplane.
Once in the controlplane, data is aggregated, enriched, analyzed, and used to generate traffic policies. Traffic policies are communicated back to all ngrok nodes in all regions. In this way, all ngrok infrastructure has a consistent view of the currently enforced traffic policies, regardless of region. The major advantage of this system is that an attack that originates in one region results in traffic policies that affect all regions. If malicious traffic were directed to other ngrok regions, they would be ready to enforce the same policies that were generated from a different region.
Another significant benefit is that it allows us to track traffic sources across regions. As an example, network-level rate limits designed to protect ngrok's infrastructure from excessive traffic are aggregated across all regions - meaning that a single traffic source is allowed a specific rate of traffic into ngrok's infrastructure, whether that's to a single region, or distributed across many regions.
Let's get more specific regarding telemetry sources, traffic policies, and how those are enforced.
Currently, FWNG ingests network telemetry from two major sources - network flows from the NetObserv eBPF Agent and our own application's metrics.
The Netobserv eBPF Agent is a part of the NetObserv Operator, maintained by the awesome folks at Red Hat. It uses eBPF programs attached to network interfaces to collect network flows and does it very efficiently.
The efficiency at which the NetObserv eBPF Agent can collect and export network flows allows low-level network monitoring without any special or dedicated hardware or access to physical routers in the network. We run the Agent on each ingress node in our system with minimal performance overhead alongside our other applications.
The agent exports network flows in protobuf-formatted messages, which we collect and process in the controlplane.
In addition to network flows, we collect application-emitted metrics. Application metrics can be very useful in enriching existing network telemetry and helping to detect traffic flows that would be difficult (or impossible) to parse with low-level network flow data. For example, viewing how many connections or how much data is destined for an ngrok tunnel owned by a particular account or how much traffic from a specific source IP was destined for an online ngrok tunnel.
Today, all policies are enforced via nftables rules on our ingress nodes. Each ingress node in the dataplane runs a small Go service that polls the controlplane for the current set of traffic policies, then updates nftables rules, tables, and sets, to enforce that policy. To make that process more ergonomic and programmable, we developed the Firewall Toolkit.
Firewall Toolkit builds on top of Google's nftables Go library and allows you to ergonomically and programmatically generate nftables rules, modify tables and sets, and set up data sources that will keep those rules, tables, and sets up to date with the data source - all via communication with the netlink socket in pure Go.
Using the library, creating an nftables rule that blocks incoming connections from IPs in a set destined to ports in a set can be as simple as:
ipv4Exprs, err := rule.Build(
expr.VerdictDrop,
rule.AddressFamily(expressions.IPv4),
rule.TransportProtocol(expressions.TCP),
rule.SourceAddressSet(ipv4Set.Set()),
rule.DestinationPortSet(portSet.Set()),
rule.Any(expressions.Counter()),
)
We have released the Firewall Toolkit as an open-source library for the community to use and encourage contributions!
Going forward, we hope to build and utilize eBPF filters to enforce traffic policies as well, for a couple of main reasons:
That gives insight into the design and goals of ngrok's new firewall system. We still have plenty of work to do!
As we build and improve that piece, we look forward to sharing more details about the controlplane aggregation and analysis components. Until then, we recommend you check out the following resources.
The latter two have been instrumental in helping us implement FWNG*!
Questions or comments? Hit us up on X (aka Twitter) @ngrokhq or LinkedIn, or join our community on Slack.
*We usually pronounce it "fwing", but a few space opera-inclined ngrokkers are really pushing for "f-wing