Network Operations Manual
The multi-vendor protocol switch monitoring checklist standardizes checks Core 4 Pillars, for the vulnerability, status, configuration, performance, traffic, data loss, data flow, and overall health of critical switches..
1. Critical Monitoring Checklist (The “Core 4”)
POLLING INTERVAL: 60S| Category | Primary Metric & SNMP OID | Threshold / Trigger | Automated Remediation |
|---|---|---|---|
| Vulnerabilities | Firmware CVE & MD5 Hash.1.3.6.1.2.1.1.1.0 |
Version Mismatch | Isolate Mgmt VTY; Alert InfoSec. |
| Status | Port Flaps & UptimeifOperStatus |
Reset > 3 / 5min | Auto-disable port; Trigger SNMP Trap. |
| Traffic | Bandwidth & Output DropsifOutDiscards |
> 70% Util | Poll sFlow; Dynamic QoS adjustment. |
| Performance | CPU & RAM LoadcpmCPUTotalMonInterval |
> 80% CPU | Enable CoPP; Log TCAM table usage. |
Stability & Optimization Protocols
Control Plane Stability
Ensure Control Plane Policing (CoPP) is active to prevent DoS attacks from impacting routing protocols. Use hardware rate-limiters for ICMP and ARP traffic.
TCAM Optimization
Regularly audit ACLs and Prefix-lists. Unused entries consume ASIC resources and slow down lookups.
Latency Jitter Analysis
Implement IP SLA probes across core links. Stability is defined by a variance of < 2ms for voice/video traffic classes.
SNMP Polling Optimization
Use 64-bit counters (HC) for high-speed interfaces (>1Gbps) to prevent counter rollover errors.
Performance & Monitoring Checklist
Packet Errors & Drops
Track CRC errors or input drops. Persistent increments usually indicate faulty cabling, SFP failure, or duplex mismatches.
Environmental Monitoring
Track internal temperature and fan status. Sudden spikes in ambient temp can lead to localized hardware failure or ASIC throttling.
Optical Telemetry (DOM)
Monitor TX/RX power levels on fiber SFPs. Detect degradation (optical drift) before the link drops entirely.
Threshold Summary
- Bandwidth Saturation > 70%
- Control Plane CPU > 80%
- System Memory Leak Check > 90%
- TCAM Table Exhaustion > 85%
- Temp (Chassis) Vendor Specific
Multi-Vendor Performance “Meta” Data
Bandwidth (Traffic)
Monitor ifInOctets and ifOutOctets across all interfaces.
Buffer Depth
Monitor Output Drops. High rates indicate congestion regardless of buffer logic (Arista vs Cisco).
Resource Health
- CPU: >80% (BGP/OSPF stress)
- Memory: >90% (Leaks/Oversized Tables)
- Storage: Bootflash fragmentation check
2. Configuration & Verification Lifecycle
| Step | Action | Key Command (Cisco/Generic) |
|---|---|---|
| 1. Verify Current | Check existing status before initiating changes. | show running-config or show vlans |
| 2. Update | Apply necessary changes (VLANs, Security). | conf t -> [commands] |
| 3. Test | Ensure the change works as intended (Data plane). | ping [target IP] or traceroute |
| 4. Verify New | Check that the running-config reflects the change. | show ip interface brief or show vlan brief |
| 5. Perm-Save | Move RAM config to NVRAM (Persistence). | write memory or copy run start |
| 6. Doc | Log the change with a timestamp in external CMDB. | External Log / Syslog / Jira |
| 7. Post-Verification | Check Neighbor Adjacencies (BGP/OSPF/CDP). | show ip bgp summary or show cdp neighbors |
3. Troubleshooting & Advanced Logging
Logging levels should be standardized to Level 4 (Warnings) or higher for production stability. Use the following commands for deep-dive investigation:
show logging | include %LINEPROTO-5-UPDOWN
monitor capture mycap start
Stability Alert: Debug Overload
Never run debug all in a production environment. Use specific ACL-filtered debugs to prevent CPU exhaustion.
4. High-Speed Discovery Workflow
SNMP v3 / LLDP Auto-Discovery / NetBox Integration
NetFlow / sFlow “Top Talkers” / Traffic Classification
Periodic show version CVE audit / Port-Security
Persistence sync / Remote Config Backup (Git/TFTP)