Thursday, June 25, 2009

Using SNMP with a Cisco router to troubleshoot & diagnose ADSL problems

We've been having numerous errors and problem with an ADSL line at the company where I work. We're trying to run an Ipsec tunnel across the ADSL line to connect to one of our small branch offices. Every few hours we would have problems, with the tunnel dropping out. Restarting the router would often fix the problem, but the problems never went away for long.

We were using good quality Cisco 1801 routers. The most useful command for debugging is 'show dsl interface atm0' and 'show int atm0'. Below is some typical output, while the ADSL line was faulty:

#show int atm 0
ATM0 is up, line protocol is up
Hardware is MPC ATMSAR (with Alcatel ADSL Module)
MTU 4470 bytes, sub MTU 4470, BW 832 Kbit/sec, DLY 440 usec,
reliability 255/255, txload 41/255, rxload 6/255
Encapsulation ATM, loopback not set
Encapsulation(s): AAL5 AAL2, PVC mode
23 maximum active VCs, 1024 VCs per VP, 1 current VCCs
VC Auto Creation Disabled.
VC idle disconnect time: 300 seconds
Last input never, output 00:00:00, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 48
Queueing strategy: Per VC Queueing
5 minute input rate 22000 bits/sec, 20 packets/sec
5 minute output rate 137000 bits/sec, 24 packets/sec
322340 packets input, 50047091 bytes, 0 no buffer
Received 0 broadcasts, 0 runts, 0 giants, 0 throttles
0 input errors, 182 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
345336 packets output, 145468959 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 output buffer failures, 0 output buffers swapped out

#show dsl interface atm0
ATM0
Alcatel 20190 chipset information
ATU-R (DS) ATU-C (US)
Modem Status: Showtime (DMTDSL_SHOWTIME)
DSL Mode: ITU G.992.1 (G.DMT) Annex A
ITU STD NUM: 0x03 0x2
Vendor ID: 'STMI' 'P '
Vendor Specific: 0x0000 0x0000
Vendor Country: 0x0F 0xB5
Chip ID: C196 (0)
DFE BOM: DFE3.0 Annex A (1)
Capacity Used: 100% 85%
Noise Margin: 1.0 dB 11.0 dB
Output Power: 20.0 dBm 12.5 dBm
Attenuation: 34.0 dB 17.0 dB
Defect Status: None None
Last Fail Code: None
Watchdog Counter: 0xA7
Watchdog Resets: 0
Selftest Result: 0x00
Subfunction: 0x00
Interrupts: 4155 (0 spurious)
PHY Access Err: 0
Activations: 3
LED Status: ON
LED On Time: 100
LED Off Time: 100
Init FW: init_AMR-3.0.014_no_bist.bin
Operation FW: AMR-3.0.014.bin
FW Source: embedded
FW Version: 3.0.14

Interleave Fast Interleave Fast
Speed (kbps): 4640 0 832 0
Cells: 716617 0 108647003 0
Reed-Solomon EC: 43244 0 9 9
CRC Errors: 25362 0 22 14
Header Errors: 15617 0 9 7
Total BER: 1610E-8 0E-0
Leakage Average BER: 2734E-8 0E-0
ATU-R (DS) ATU-C (US)
Bitswap: enabled enabled
Bitswap success: 0 0
Bitswap failure: 0 0

You can see from the above that the 'Down Stream' (DS) noise margin is very low at 1dB. We have also seen 0.5dB reported, or even 0db! You can also see the error counts are high. In the output of 'show int atm 0', the figure to monitor is the number of CRC errors - in the output above it's 182. It is also useful to set 'debug atm errors' on the cisco. The in the syslog's you will see messages like:

*Jun 19 10:22:01.894: ATM0: AAL5 rx errors (status = 0C030000)
*Jun 19 10:22:01.974: ATM0: AAL5 rx errors (status = 2C030000)
*Jun 19 10:22:02.758: ATM0: AAL5 rx errors (status = 0C030000)
*Jun 19 10:22:04.058: ATM0: AAL5 rx errors (status = 2C070000)
*Jun 19 10:22:04.922: ATM0: AAL5 rx errors (status = 0C030000)
*Jun 19 10:22:06.174: ATM0: AAL5 rx errors (status = 0C030000)
*Jun 19 10:22:07.822: ATM0: AAL5 rx errors (status = 0C030000)

To try and dig deeper, I used SNMP (from a Linux box on the same LAN), to query the router:

[root@hexdns1 snmp_queries]# snmpwalk -v1 -c public 192.168.6.1 1.3.6.1.2.1.10.94
SNMPv2-SMI::transmission.94.1.1.1.1.1.19 = INTEGER: 2
SNMPv2-SMI::transmission.94.1.1.1.1.2.19 = INTEGER: 3
SNMPv2-SMI::transmission.94.1.1.2.1.2.19 = STRING: "P"
SNMPv2-SMI::transmission.94.1.1.2.1.3.19 = STRING: "0"
SNMPv2-SMI::transmission.94.1.1.2.1.4.19 = Gauge32: 110
SNMPv2-SMI::transmission.94.1.1.2.1.5.19 = Gauge32: 175
SNMPv2-SMI::transmission.94.1.1.2.1.7.19 = Gauge32: 200
SNMPv2-SMI::transmission.94.1.1.2.1.8.19 = Gauge32: 4492000
SNMPv2-SMI::transmission.94.1.1.3.1.1.19 = ""
SNMPv2-SMI::transmission.94.1.1.3.1.2.19 = STRING: "STMI"
SNMPv2-SMI::transmission.94.1.1.3.1.3.19 = STRING: "0"
SNMPv2-SMI::transmission.94.1.1.3.1.4.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.3.1.5.19 = Gauge32: 340
SNMPv2-SMI::transmission.94.1.1.3.1.7.19 = Gauge32: 130
SNMPv2-SMI::transmission.94.1.1.3.1.8.19 = Gauge32: 967000
SNMPv2-SMI::transmission.94.1.1.4.1.2.19 = Gauge32: 4448000
SNMPv2-SMI::transmission.94.1.1.5.1.2.19 = Gauge32: 832000
SNMPv2-SMI::transmission.94.1.1.6.1.1.19 = Counter32: 697
SNMPv2-SMI::transmission.94.1.1.6.1.2.19 = Counter32: 449
SNMPv2-SMI::transmission.94.1.1.6.1.4.19 = Counter32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.5.19 = Counter32: 1820
SNMPv2-SMI::transmission.94.1.1.6.1.6.19 = Counter32: 31
SNMPv2-SMI::transmission.94.1.1.6.1.10.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.11.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.13.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.14.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.15.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.17.19 = Gauge32: 565
SNMPv2-SMI::transmission.94.1.1.6.1.18.19 = Gauge32: 381
SNMPv2-SMI::transmission.94.1.1.6.1.20.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.21.19 = Gauge32: 1034
SNMPv2-SMI::transmission.94.1.1.6.1.22.19 = Gauge32: 10
SNMPv2-SMI::transmission.94.1.1.7.1.1.19 = Counter32: 45
SNMPv2-SMI::transmission.94.1.1.7.1.2.19 = Counter32: 28
SNMPv2-SMI::transmission.94.1.1.7.1.3.19 = Counter32: 0
SNMPv2-SMI::transmission.94.1.1.7.1.4.19 = Counter32: 15043
SNMPv2-SMI::transmission.94.1.1.7.1.8.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.7.1.9.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.7.1.10.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.7.1.11.19 = Gauge32: 328
SNMPv2-SMI::transmission.94.1.1.7.1.13.19 = Gauge32: 44
SNMPv2-SMI::transmission.94.1.1.7.1.14.19 = Gauge32: 27
SNMPv2-SMI::transmission.94.1.1.7.1.15.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.7.1.16.19 = Gauge32: 3717
SNMPv2-SMI::transmission.94.1.1.10.1.3.19 = Counter32: 1864592813
SNMPv2-SMI::transmission.94.1.1.10.1.4.19 = Counter32: 804238
SNMPv2-SMI::transmission.94.1.1.10.1.10.19 = Gauge32: 2538597
SNMPv2-SMI::transmission.94.1.1.10.1.11.19 = Gauge32: 11493
SNMPv2-SMI::transmission.94.1.1.10.1.15.19 = Gauge32: 804265846
SNMPv2-SMI::transmission.94.1.1.10.1.16.19 = Gauge32: 400641
SNMPv2-SMI::transmission.94.1.1.11.1.3.19 = Counter32: 31565
SNMPv2-SMI::transmission.94.1.1.11.1.4.19 = Counter32: 56466
SNMPv2-SMI::transmission.94.1.1.11.1.10.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.11.1.11.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.11.1.15.19 = Gauge32: 18799
SNMPv2-SMI::transmission.94.1.1.11.1.16.19 = Gauge32: 44213

If you have the ADSL-LINE-MIB in place, these OID's can be translated:

# snmptranslate 1.3.6.1.2.1.10.94
SNMPv2-SMI::transmission.94
# export MIBS=ALL
# snmptranslate 1.3.6.1.2.1.10.94
ADSL-LINE-MIB::adslMIB

For a guide on how to use Linux SNMP, see my snmp tutorial.

# snmpwalk -v1 -c public 192.168.6.1 1.3.6.1.2.1.10.94
ADSL-LINE-MIB::adslLineCoding.19 = INTEGER: 2
ADSL-LINE-MIB::adslLineType.19 = INTEGER: interleavedOnly(3)
ADSL-LINE-MIB::adslAtucInvVendorID.19 = STRING: P
ADSL-LINE-MIB::adslAtucInvVersionNumber.19 = STRING: 0
ADSL-LINE-MIB::adslAtucCurrSnrMgn.19 = Wrong Type (should be INTEGER): Gauge32: 110
ADSL-LINE-MIB::adslAtucCurrAtn.19 = Gauge32: 170 tenth dB
ADSL-LINE-MIB::adslAtucCurrOutputPwr.19 = Wrong Type (should be INTEGER): Gauge32: 200
ADSL-LINE-MIB::adslAtucCurrAttainableRate.19 = Gauge32: 4686000 bps
ADSL-LINE-MIB::adslAturInvSerialNumber.19 = STRING:
ADSL-LINE-MIB::adslAturInvVendorID.19 = STRING: STMI
ADSL-LINE-MIB::adslAturInvVersionNumber.19 = STRING: 0
ADSL-LINE-MIB::adslAturCurrSnrMgn.19 = Wrong Type (should be INTEGER): Gauge32: 15
ADSL-LINE-MIB::adslAturCurrAtn.19 = Gauge32: 340 tenth dB
ADSL-LINE-MIB::adslAturCurrOutputPwr.19 = Wrong Type (should be INTEGER): Gauge32: 130
ADSL-LINE-MIB::adslAturCurrAttainableRate.19 = Gauge32: 978000 bps
ADSL-LINE-MIB::adslAtucChanCurrTxRate.19 = Gauge32: 4640000 bps
ADSL-LINE-MIB::adslAturChanCurrTxRate.19 = Gauge32: 832000 bps
ADSL-LINE-MIB::adslAtucPerfLofs.19 = Counter32: 453
ADSL-LINE-MIB::adslAtucPerfLoss.19 = Counter32: 13
ADSL-LINE-MIB::adslAtucPerfLprs.19 = Counter32: 0
ADSL-LINE-MIB::adslAtucPerfESs.19 = Counter32: 1997
ADSL-LINE-MIB::adslAtucPerfInits.19 = Counter32: 18
ADSL-LINE-MIB::adslAtucPerfCurr15MinLofs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAtucPerfCurr15MinLoss.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAtucPerfCurr15MinLprs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAtucPerfCurr15MinESs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAtucPerfCurr15MinInits.19 = Gauge32: 0
ADSL-LINE-MIB::adslAtucPerfCurr1DayLofs.19 = Gauge32: 445 seconds
ADSL-LINE-MIB::adslAtucPerfCurr1DayLoss.19 = Gauge32: 14 seconds
ADSL-LINE-MIB::adslAtucPerfCurr1DayLprs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAtucPerfCurr1DayESs.19 = Gauge32: 1425 seconds
ADSL-LINE-MIB::adslAtucPerfCurr1DayInits.19 = Gauge32: 9
ADSL-LINE-MIB::adslAturPerfLofs.19 = Counter32: 22 seconds
ADSL-LINE-MIB::adslAturPerfLoss.19 = Counter32: 15 seconds
ADSL-LINE-MIB::adslAturPerfLprs.19 = Counter32: 0 seconds
ADSL-LINE-MIB::adslAturPerfESs.19 = Counter32: 95908 seconds
ADSL-LINE-MIB::adslAturPerfCurr15MinLofs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAturPerfCurr15MinLoss.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAturPerfCurr15MinLprs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAturPerfCurr15MinESs.19 = Gauge32: 345 seconds
ADSL-LINE-MIB::adslAturPerfCurr1DayLofs.19 = Gauge32: 22 seconds
ADSL-LINE-MIB::adslAturPerfCurr1DayLoss.19 = Gauge32: 15 seconds
ADSL-LINE-MIB::adslAturPerfCurr1DayLprs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAturPerfCurr1DayESs.19 = Gauge32: 21932 seconds
ADSL-LINE-MIB::adslAtucChanCorrectedBlks.19 = Counter32: 4045523380
ADSL-LINE-MIB::adslAtucChanUncorrectBlks.19 = Counter32: 23129426
ADSL-LINE-MIB::adslAtucChanPerfCurr15MinCorrectedBlks.19 = Gauge32: 6278668
ADSL-LINE-MIB::adslAtucChanPerfCurr15MinUncorrectBlks.19 = Gauge32: 17489
ADSL-LINE-MIB::adslAtucChanPerfCurr1DayCorrectedBlks.19 = Gauge32: 1783723602
ADSL-LINE-MIB::adslAtucChanPerfCurr1DayUncorrectBlks.19 = Gauge32: 10061120
ADSL-LINE-MIB::adslAturChanCorrectedBlks.19 = Counter32: 4235
ADSL-LINE-MIB::adslAturChanUncorrectBlks.19 = Counter32: 9371
ADSL-LINE-MIB::adslAturChanPerfCurr15MinCorrectedBlks.19 = Gauge32: 0
ADSL-LINE-MIB::adslAturChanPerfCurr15MinUncorrectBlks.19 = Gauge32: 0
ADSL-LINE-MIB::adslAturChanPerfCurr1DayCorrectedBlks.19 = Gauge32: 1538
ADSL-LINE-MIB::adslAturChanPerfCurr1DayUncorrectBlks.19 = Gauge32: 5374

Many of the figure in the output of 'show dsl interface atm0' are available via SNMP. For example 'Down Stream Noise Margin' is 'adslAturCurrSnrMgn' quoted in tenths of a decibel. So a reading of 15 means 1.5dB. To get an idea of how these values were changing with time, I used Cacti and RRD to poll the SNMP and produce some nice graphs. I Googles and found a Cacti 'ADSL Line Data' templates that allows you to graph Attenuation, Noise margin, and Output Power. Of those, only Noise Margin is interesting, because I found the Attenuation and Output Power to remain constant. But the thing the template did not graph was the number of uncorrectable errors, labelled 'adslAtucChanUncorrectBlks' in the MIB, and this is what I was really interested in, as when we were having problems with the ADSL line, the count of uncorrectable errors was high.

So I modified '/var/www/cacti/resource/snmp_queries/adslLine.xml' to include another OID. I chose 'ADSL-LINE-MIB::adslAtucChanPerfCurr15MinUncorrectBlks' which the MIB decribes as follows:

# snmptranslate -Td ADSL-LINE-MIB::adslAtucChanPerfCurr15MinUncorrectBlks
ADSL-LINE-MIB::adslAtucChanPerfCurr15MinUncorrectBlks
adslAtucChanPerfCurr15MinUncorrectBlks OBJECT-TYPE
-- FROM ADSL-LINE-MIB
MAX-ACCESS read-only
STATUS current
DESCRIPTION "Count of all blocks received with uncorrectable
errors on this channel within the current 15 minute
interval."

I modified the XML for the data query as follows:

# snmptranslate -On ADSL-LINE-MIB::adslAtucChanPerfCurr15MinUncorrectBlks
.1.3.6.1.2.1.10.94.1.1.10.1.11

# diff -u adslLine.orig.xml adslLine.xml
--- adslLine.orig.xml 2009-05-01 16:57:18.000000000 +0100
+++ adslLine.xml 2009-06-15 18:53:58.000000000 +0100
@@ -27,6 +27,13 @@
<direction>input</direction>
<oid>.1.3.6.1.2.1.10.94.1.1.3.1.2</oid>
</adslAtucInvVendorID>
+ <adslAtuc15mUncrBlks>
+ <name>Local Curr 15min Uncorrectable Blocks</name>
+ <method>walk</method>
+ <source>value</source>
+ <direction>output</direction>
+ <oid>.1.3.6.1.2.1.10.94.1.1.10.1.11</oid>
+ </adslAtuc15mUncrBlks>
<adslAtucCurrSnrMgn>
<name>Local Noise Margin (tenth dB)</name>
<method>walk</method>

When I started to graph this value, I was surprised by the range of values I saw. Often it was low, a few hundred, but some time it would shoot up to 60,000! I realised that to graph this range of values it would be better to use a logarithmic scale.

After graphing uncorrectable errors for a while I began to question what does 'within the current 15 minute interval.' actually mean. So let's look in detail at the OID's related to uncorrectable errors. Here are some (high) starting figures:

# snmpwalk -v1 -c public 192.168.6.1 1.3.6.1.2.1.10.94.1.1.10.1
SNMPv2-SMI::transmission.94.1.1.10.1.3.19 = Counter32: 293561906
SNMPv2-SMI::transmission.94.1.1.10.1.4.19 = Counter32: 1493720
SNMPv2-SMI::transmission.94.1.1.10.1.10.19 = Gauge32: 6671894
SNMPv2-SMI::transmission.94.1.1.10.1.11.19 = Gauge32: 80028
SNMPv2-SMI::transmission.94.1.1.10.1.15.19 = Gauge32: 293561906
SNMPv2-SMI::transmission.94.1.1.10.1.16.19 = Gauge32: 1493720

You can see that "adslAtucChanCorrectedBlks" and "adslAtucChanPerfCurr1DayUncorrectBlks" are identical values, and they were increasing exactly in step, so they are both SNMP 'Counters'. A few minutes latter, we get these reading:

# snmpwalk -v1 -c public 192.168.6.1 1.3.6.1.2.1.10.94.1.1.10.1
SNMPv2-SMI::transmission.94.1.1.10.1.3.19 = Counter32: 293628501
SNMPv2-SMI::transmission.94.1.1.10.1.4.19 = Counter32: 1495372
SNMPv2-SMI::transmission.94.1.1.10.1.10.19 = Gauge32: 27251
SNMPv2-SMI::transmission.94.1.1.10.1.11.19 = Gauge32: 1383
SNMPv2-SMI::transmission.94.1.1.10.1.15.19 = Gauge32: 293628501
SNMPv2-SMI::transmission.94.1.1.10.1.16.19 = Gauge32: 1495372

In the period between taking these readings, the '15Min' figures, have reset to zero, and are counting up again. If you keep a constant watch on these figures, you can see that every 15 minutes they reset back to zero. I found the following command to be useful to monitor errors occuring in real time.

# watch -n 5 'snmpwalk -v1 -c public 192.168.6.1 1.3.6.1.2.1.10.94.1.1.10.1'

So with the benefit of hindsight, I may have been better using Cacti to graph 'adslAtucChanUncorrectBlks' as a 'COUNTER', rather than 'adslAtucChanPerfCurr15MinUncorrectBlks' as a 'GAUGE'.

Eventually we got BT to come on site to check the ADSL line, and they located
a bad connection. Since they fixed that, the noise margin has remained consistently above 5dB and the uncorrected blocks has been very low.

Here are some graph, where you can that the ADSL line has been fixed:

1 comment:

Juan Cabrio said...

I'm curious to what adslAtucChanPerfCurr15MinUncorrectBlks actually represents as on a Cisco 877 I see CRC errors incrementing much faster than adslAtucChanPerfCurr15MinUncorrectBlk ?