Tips for Network Engineers to make life easier
Not technical, but (hopefully) helpful list of tips learned the hard way by myself or from others before me.
-
Color code your Terminal/CLI sessions. All terminals have this feature, I use SecureCRT and change background of the saved sessions according to the importance - backbone black, production - gray, lab - light beige. I once run "reload" command on production Cisco, thinking I am in the lab Cisco session :). With different colors for each it is much harder to do so.
-
Enable logging of all Terminal sessions. I mean all input/output of the session. This will help on so many occasions: to prove it was not you who did the change, to trace back change made few months ago to solve an issue, to retrieve lost setting/parameter in the config. The real ninjas, after troubleshooting an issue, go look at those logs to learn what went wrong/OK, and write up notes to others or future themselves. Debug network problems using OSI model, starting with Layer 1. And remember to add Layer 8 for the User. Upper OSI layers are very smart and may mask Layer 1 or 2 problems by trying to compensate.
-
Even in our times of super-duper-AI and 400Gb connections, cables will fail, will cause CRC error, SFP+ will mis-behave. We once had E1 line flapping because of faulty grounding of the Cisco router, it took long time to understand. So, do not discard Layer 1 problems as out of existence today.
-
MTU mismatch problems are the most ugly. They can cause slowness in specific apps only, jitter, bouncing up/down throughput, BGP sessions being established but only partial/no routing info is exchanged. Use ping with Don’t Fragment bit set end-to-end. I wrote how to do it in different OS (Solaris anyone?) back in the 2009 - https://yurisk.info/2009/09/01/ping-setting-dont-fragment-bit-in-linuxfreebsdsolarisciscojuniper/
-
Layer 2 issues causing loops are the second ugliest. STP frenzy, cabling misconnections, users connecting no-one-knows-what to the network. The easiest and proven way to debug here, sorry not elegant, is to start disconnecting uplinks/switches to find the culprit and only then investigate further. Whole building going down because some smart-pants developer brought docking station featuring full STP capabilities? Easily. Read in your free time AMS-IX outage report because of the LACP flood https://ripe87.ripe.net/presentations/119-AMS-IX_outage_2023_v2.pdf.
-
Everyone lies, except logs. Take every user input with a pound of salt, "Sure, I believe you that no one touched this Fortigate, I can totally imagine that Fortigate itself disconnected cable from its WAN1 interface and connected it to the port7, why not?". Verify, verify, logs do not lie.
-
Never do network changes on Friday (Thursday in Israel) or before holidays. Better be ready and prepared to solve issues the next working day, than be surprised on Monday morning. Before substantial changes, agree on a set of tests after the change that would prove that all systems go.
-
Have ocean of patience, learn Zen/Meditation/Exercise, as it has always been and will always be - "it is a network/firewall issue". Also after proving this was not the case for the 10e3th time, this mantra will not go away. Once established in your workplace - bluntly ask for the evidence that it is 'a network issue', what’s your proof? When nothing helps - tell "It is sure a DNS issue" :).
-
When working on important gear or remotely - use vendor’s roll back safeguards. Today all vendors have such - commit confirmed <minutes> (Juniper), reload in <hh:mm:ss> (Cisco), cfg-save (Fortigate). For Fortigates see https://yurisk.info/2024/12/07/fortigate-revert-configuration-as-a-safety-measure-analog-to-cisco-reload-in-or-juniper-commit-confirmed/
-
Never learn/Google the same thing twice - take notes. No one knows everything, but chances are good if you had to Google some config/command/template, you will need it some day again. I use MS OneNote for everything, with back up to cloud at home, back up to local network drive at work. You can protect notes with password if sensitive info is kept (256 AES encrypted).
-
Keep documentation for all active equipment in the location that is easy to find. Try to find Nortel guides today and you will have nothing, MRV ? The same. And clients still have those museum boxes.
-
Learn protocols the Internet runs on - IP/TCP, and learn to interpret what you see in the Wireshark. It doesn’t happen often, fortunately, but when no-idea-what-happens problem occurs - the packet capture is the ultimate source of truth. But to actually understand what you see, you need to know protocols and know your way around Wireshark.
-
Know your network - monitor everything. Netflow is the life-saver, have it collected, parsed, be available for real-time analysis. In the large networks, logging into every box is just not doable.
-
Back up configurations on schedule and on changes. Be it ancient Rancid or some netbox for million dollars, the result is the same - backups of all gear stored in a versioning system (Git today).
-
Do not forget to use "add" when adding VLANs in Cisco.
-
When estimating planned downtime - estimate, then multiply by 3.
I also write cheat sheets/scripts/guides to help in daily work, so make sure to check out my Github at https://github.com/yuriskinfo