This is the way: Holistic approach on network automation

In the past months and years I’ve had a number of great discussion with a lot of fantastic networking people on how to (not) do network automation (at scale), what worked for them and what didn’t. At a smaller scale I made quite some experiences myself at previous roles as well as consulting engagements and in particular by building and operating the Freifunk Hochstift community ISP network plus it’s SDN. This article is the distilled result of those discussions and experiences and might be somewhat opinionated.

Network automation done right

I’ve seen a number of approaches where the tool of choice (Cisco Prime Infrastructure, Ansible, Salt, self-build Expect/Perl/Python scripts, …) only did the initial config and/or updated parts of the configuration as needed to add new users, add and apply ACLs, etc. Interfaces, VLANs, IPs were configured manually which eventually resulted in configuration of devices diverging over time, interfaces not having OSPF/IS-IS configured at all or for one address family only, having different defaults for OSPF/IS-IS passive interfaces, having no ACL configured, CDP/LLDP enabled/disabled, etc. You’ve got the idea.

Did you notice I only spoke of changing and adding stuff in the examples above? Another thing which can be seen often is old config sticking around and stumbling across unused ACLs, VLANs, and even user accounts of people who left years ago. The latter mostly on devices which haven’t been touched in a while for whatever reason.

This leads me to the conclusion that the only way to achieve a reliable, uniform and surprise-free network automation is to go all in and let your automation own the full configuration of all devices. There I said it.

But Max, what if I have to configure something manual?

Well you shouldn’t have to.

The idea behind this is that there is an automation stack in place which generates the whole configuration for your device(s) and pushes the last generated version on the box(es). This might be because of a timer triggered, you pushed some buttons / ran some command to achieve X or some other automation decided it’s time to do Y. For X and Y being anything of configure a VLAN, IP adress(es), ACLs, BGP neighbors, traffic engineering, you name it. Which means you shouldn’t have the need to configure anything manually on the devices.

This does not limit/affect the ability for your Network Engineers to ssh into the boxes and do debugging stuff via the CLI.

But Max, what if the automation has a problem?

Well yes, this will eventually happen, no argument here. Therefore your automation needs a big red button to enable the kill switch for your entire automation or for a particular part of it (per devices or whatever).

But Max, this can’t be done, because it’s way to complicated and we can’t afford this!

Yes it’s a non-trivial endeavor , but it can be done and people have done it. My return question would be: Can you afford hiring Network Engineers to keep up and maintain stuff (partially) manually and clean up when something falls over?

But Max, where does the automation get all the interfaces, IPs, BGP neighbors, … from?!

Excellent questions, I’m glad you asked! This approach obviously requires you to have some kind of DCIM + IPAM which hold which devices are where, doing what, are connected to what, have which IPs, etc. This information has to be somewhere anyway right? So get it structured so you can leverage it from all tools which could rely on those information. Maybe Netbox or Nautobot can be of help for you, maybe not. Maybe you have other DCIM and/or IPAM solutions in place – as long as they have at least a read-only API you can build on that. Maybe you have other systems in place which hold your customer DB including which customer is connected on which router ports, has which prefixes, subscribed to x amount of bandwidth, etc. Use what you have.