Today I've double checked the QoS situation with Live Meeting, with the help of my colleague Michael Melling, and here are our conclusions:
DSCP QosEnabled works fine with Office Communication 2007 R2 on Windows XP, Vista and Windows 7, provided you use the latest updated 3.5.6907.37 version. You can use WireShark to confirm the correct DSCP values are being used, as per my earlier blog post. The audio packets and the video packets will have a different DSCP value, so you can give them a appropriate priority over your network.
Now what about Office Live Meeting? We are using the 8.0.6362.128 version, which AFAIK is the latest. If you are running Window XP, then no problem. If you are trying to use Vista or Window 7, then there is a problem, as with those operating systems, Live Meeting fails to set a DCSP code on the packets.
In our tests today, we started a video call in Office Communicator, then switched into 'Share Information Using Live Meeting' mode. This opens the Live Meeting window, but leaves audio & video going through Office Communicator. But as we are in conference mode, all A/V packets are
now going via the OCS server. At this stage DSCP is still working fine. (We also have DCSP enabled on our OCS 2007 R2 server).
Now by clicking the WebCam icon in Live Meeting, you can flip the Video into a separate windows called 'Voice & Video'. (Don't forget you also need to click the 'Join Audio' button). Now Live Meeting is handling the Video, and that's fine if you are using Windows XP. But if you use Vista or Windows 7, a check with WireShark will show none of the UDP packets being sent by Live Meeting have a DSCP value - they are all marked 0x00.
To me that looks like a bug in Live Meeting.
There is a partial work around. Using Group Policy on Vista and Window 7, you can force all packets generated by a particular application to use a DSCP value that you specify. Unfortunately this does not allow you to differentiate between Audio and Video.
Here are a couple of screen-shots were I setup a local group policy on a Vista PC, so that any UDP packets sent by Live Meeting are given a DSCP value of 56'd. I used that value just while testing, to differentiate then from the A/V packets from Office Communicator.
Friday, August 21, 2009
Wednesday, July 29, 2009
Microsoft Office Communicator 2007 R2 and DSCP QoS
Anyone trying to get DSCP working with Office Communicator 2007 R2 on Windows XP needs to grab the 'July' update (v3.5.6907.37) otherwise you will be having problems getting video to work. Here are the links:
(BTW, I'd like to thank Michael Melling for spotting those links, and for helping with the testing that I describe below.)
Before the update, with Office Communicator versions v3.5.6907.0 or v3.5.6907.34, on Windows XP, after setting QoSEnabled to 1, you find audio continues to work fine, but video fails. Most of the time we just got a black video windows, sometimes it was a grey window, and sometimes a frozen image.
Using OCS's "Monitoring Server Reports" you can see whats going wrong. Click the "User Call List" link, and then filter by one of the users in the call, and this will give you a detailed statistical report of all finished calls. For the Video Stream you will see a "Packet loss Rate" of about 60% !
Office Communicator, with QoSEnabled, works fine on Vista and Windows 7, without this 'July' update. I think that Microsoft must have done the majority of their testing with their latest operating systems, and have only recently got around to testing with Windows XP.
We have been battling this problem for a couple of weeks, so this fix comes just in time! In trying to investigate what was goingwrong, we have learned a great deal, so not all the time was wasted, just some!
DSCP QoS is enabled in the registry, like this:
.. as per these instructions:
BTW, you need to exit (not just close) Office Communicator and restart it, for this registry setting to take effect.
Once Office Communicator has setup the call using SIP (over TCP), using Office Communications Server 2007 R2, all audio & video network traffic is sent directly between the two clients using RTP (Real Time Protocol) packets in UDP packets.
Using WireShark, we could see the packets being tagged correctly:
BTW, these are the default values. You can change them using a group policy.
In WireShark, to just grab the audio & video traffic, use a capture filter like this:
This will match any UDP packets that have a non-zero DSCP field in the IP header.You can use the following display filters to select either Audio or Video packets:
BTW, if you use tcpdump to capture the packets, remember that the DSCP values are multiplied by four. So use these capture filters with tcpdump:
It is very useful to be able to see the RTP headers in the packets. To do this in WireShark, right-click on any of the packets of a UDP stream, and select 'Decode As..', the select 'Transport' as RTP.
Once you have WireShark decoding the RTP header, you can see a couple of interesting fields in the header. The first is 'Payload type',and this is set as follows:
The second interesting RTP header field is the 'Sequence number'. This is a 16-bit counter field, that increments by 1 for each packet sent in a particular stream, and is useful for detecting dropped or duplicated packets. (The sequence number seems to start at a random value.)
There is a useful feature in WireShark which will do 'Stream Anaysis..'. On WireShark v1.2.0 you will find this under the 'Telephony > RTP' menu. This will highlight where the sequence number goes wrong.
Ok, so why was the video failing, (before the 'July' update for Office Communication). We had two laptops, running Office Communicator, connected together via a HP Procurve switch,and enabled port mirroring on the switch, to a third laptop to see all network traffic. Running WireShark on the client sending video, we saw a lot more video packet, then were being received at the other end, on the receiving client. The third monitoring laptop confirmed that packets were being dropped before they emerged 'onto the wire' from the sending laptop. Looking closely at the WireShark capture from the laptop sending video, and using the 'Stream Analysis..' we saw many video packets with duplicate RTP sequence numbers! Correlating with the capture on the receiving client, we saw that any packet that had been duplicated was missing. It was not a case of just the duplicated packet being missing, but also the original packet. Thus from the point of view of the receiving packet, there was a massive level of packet loss, hence it could not display the video stream.
BTW, on Windows XP, you can use 'tcmon.exe' from the resource kit to see what QoS policies are in effect, and this confirms packet loss onthe video stream. Here are some screen shots:
Unfortunately tcmon does not run on Vista or Windows 7, which is very disapointing. Let's hope that Microsoft get around to fixing this!
BTW, if you are using Linux to priorities the audio & video traffic as it is sent into the WAN, like we are, you will find the following iptables & 'tc filters' useful to match the packets:
Best Regards
Nigel Smith
"Description of the update for Communicator 2007 R2: July 2009"
http://support.microsoft.com/kb/969695/
"Video frames are not displayed in Office Communicator 2007 R2 on a Windows XP-based computer"
http://support.microsoft.com/kb/971846/
(BTW, I'd like to thank Michael Melling for spotting those links, and for helping with the testing that I describe below.)
Before the update, with Office Communicator versions v3.5.6907.0 or v3.5.6907.34, on Windows XP, after setting QoSEnabled to 1, you find audio continues to work fine, but video fails. Most of the time we just got a black video windows, sometimes it was a grey window, and sometimes a frozen image.
Using OCS's "Monitoring Server Reports" you can see whats going wrong. Click the "User Call List" link, and then filter by one of the users in the call, and this will give you a detailed statistical report of all finished calls. For the Video Stream you will see a "Packet loss Rate" of about 60% !
Office Communicator, with QoSEnabled, works fine on Vista and Windows 7, without this 'July' update. I think that Microsoft must have done the majority of their testing with their latest operating systems, and have only recently got around to testing with Windows XP.
We have been battling this problem for a couple of weeks, so this fix comes just in time! In trying to investigate what was goingwrong, we have learned a great deal, so not all the time was wasted, just some!
DSCP QoS is enabled in the registry, like this:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\RTC\Transport
"QoSEnabled"=dword:00000001
.. as per these instructions:
"Enabling DSCP Marking"
http://technet.microsoft.com/en-us/library/dd441192(office.13).aspx
BTW, you need to exit (not just close) Office Communicator and restart it, for this registry setting to take effect.
Once Office Communicator has setup the call using SIP (over TCP), using Office Communications Server 2007 R2, all audio & video network traffic is sent directly between the two clients using RTP (Real Time Protocol) packets in UDP packets.
Using WireShark, we could see the packets being tagged correctly:
Audio - DSCP 0x28: Class Selector CS5
Video - DSCP 0x18: Class Selector CS3
BTW, these are the default values. You can change them using a group policy.
In WireShark, to just grab the audio & video traffic, use a capture filter like this:
udp and ip[1]!=0
This will match any UDP packets that have a non-zero DSCP field in the IP header.You can use the following display filters to select either Audio or Video packets:
For Audio, use:
ip.dsfield.dscp == 40
For Video, use:
ip.dsfield.dscp == 24
BTW, if you use tcpdump to capture the packets, remember that the DSCP values are multiplied by four. So use these capture filters with tcpdump:
For Audio:
ip[1]=0xA0
ip[1]=160
For Video:
ip[1]=0x60
ip[1]=96
It is very useful to be able to see the RTP headers in the packets. To do this in WireShark, right-click on any of the packets of a UDP stream, and select 'Decode As..', the select 'Transport' as RTP.
Once you have WireShark decoding the RTP header, you can see a couple of interesting fields in the header. The first is 'Payload type',and this is set as follows:
Type 97 - RT-Audio encoded redundant data
Type 111 - 'Siren' audio - used by 'LiveMeeting'
Type 114 - RTaudio (x-msrta)
Type 118 - 'Comfort Noise'
Type 121 - RTvideo (x-rtvc1)
The second interesting RTP header field is the 'Sequence number'. This is a 16-bit counter field, that increments by 1 for each packet sent in a particular stream, and is useful for detecting dropped or duplicated packets. (The sequence number seems to start at a random value.)
There is a useful feature in WireShark which will do 'Stream Anaysis..'. On WireShark v1.2.0 you will find this under the 'Telephony > RTP' menu. This will highlight where the sequence number goes wrong.
Ok, so why was the video failing, (before the 'July' update for Office Communication). We had two laptops, running Office Communicator, connected together via a HP Procurve switch,and enabled port mirroring on the switch, to a third laptop to see all network traffic. Running WireShark on the client sending video, we saw a lot more video packet, then were being received at the other end, on the receiving client. The third monitoring laptop confirmed that packets were being dropped before they emerged 'onto the wire' from the sending laptop. Looking closely at the WireShark capture from the laptop sending video, and using the 'Stream Analysis..' we saw many video packets with duplicate RTP sequence numbers! Correlating with the capture on the receiving client, we saw that any packet that had been duplicated was missing. It was not a case of just the duplicated packet being missing, but also the original packet. Thus from the point of view of the receiving packet, there was a massive level of packet loss, hence it could not display the video stream.
BTW, on Windows XP, you can use 'tcmon.exe' from the resource kit to see what QoS policies are in effect, and this confirms packet loss onthe video stream. Here are some screen shots:
Unfortunately tcmon does not run on Vista or Windows 7, which is very disapointing. Let's hope that Microsoft get around to fixing this!
BTW, if you are using Linux to priorities the audio & video traffic as it is sent into the WAN, like we are, you will find the following iptables & 'tc filters' useful to match the packets:
# Audio:
iptables -t mangle -A POSTROUTING -o eth0 -p udp -m dscp --dscp 40 -j mark1
iptables -t mangle -A POSTROUTING -o eth0 -p udp -m dscp --dscp-class CS5 -j mark1
tc filter add dev eth0 parent 10:0 prio 1 protocol ip u32 match ip tos 0xA0 0xff flowid 10:200
# Video:
iptables -t mangle -A POSTROUTING -o eth0 -p udp -m dscp --dscp 24 -j mark2
iptables -t mangle -A POSTROUTING -o eth0 -p udp -m dscp --dscp-class CS3 -j mark2
tc filter add dev eth0 parent 10:0 prio 1 protocol ip u32 match ip tos 0x60 0xff flowid 10:210
Best Regards
Nigel Smith
OpenSolaris 2009-06 and the Comstar iScsi Target
Having upgraded from OpenSolaris 2008-11 to 2009-06, I wanted to try the new Comstar iscsi target, to see how well it worked, and to see if it was ready to replace the 'old' iscsi target. And I wanted to see how far I could get with the package versions in the 'production-quality' release repository, based on snv_111. To start with, I just wanted to get something basic going as quickly as possible.
The first tasks was to disable the old iscsi target (to avoid a conflict with the new target which would also want to listen on TCP port 3260), and to enable the 'SCSI Target Mode Framework' - stmf.
Next I created a new small zfs volume.
(My zfs pool is named 'rz2pool' - it's Raidz2 - dual parity.)
And then created a 'logical unit', using the zvol as it's backing store.
Now using the GUID of the logical unit, we make it visible to any initiators, by adding a 'view'
Next we need to do the iscsi part, for which we need to install the iscsi target and the 'itadm' command. These were not already installed so I had to add the 'SUNWiscsit' package.
On the next step I had a small problem.
(BTW, I am not using NWAM to configure my network card.)
Rebooting solved this problem.
So now we create the iscsi target and check it looks ok:
I was then able to discover and logon to this iscsi target from a Windows 2003 server, using the Microsoft iscsi initiator.
The above is just the simplest possible configuration, just to allow some simple tests. For a more practical setup, you may well want multiple logical units, and want to put some restrictions on the views. But details on how to do that will have to wait for a follow-up post to this blog!
I did notice a couple of things fairly quickly. I saw the connection between the initiator and target was dropping every 50 seconds, followed immediately by a re-logon. Checking with WireShark showed that it was the target starting this with a [FIN, ACK]. With the old iscsi target I would have used DTrace to confirm this, but the iscsi target provider for the new iscsi target was not available.
I looked at the source code for the comstar iscsi target:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/comstar/port/iscsit/iscsit.c
.. and from the history link, I could see that a couple of likely fixes had already been committed:
Now normally, I would hyper-link those bug number for you. But in this case there is no point, because Sun has chosen not to allows details of those bugs to be publicly available. (Or any of the other comstar iscsi target bugs AFAIK.)
Having those fixes is important for me, so it looks like I will have to update my OpenSolaris 2009-06 using the development repository to snv_118.
The first tasks was to disable the old iscsi target (to avoid a conflict with the new target which would also want to listen on TCP port 3260), and to enable the 'SCSI Target Mode Framework' - stmf.
# svcs iscsitgt
online Jul_09 svc:/system/iscsitgt:default
# svcs stmf
disabled Jul_09 svc:/system/stmf:default
# svcadm disable iscsitgt
# svcs iscsitgt
STATE STIME FMRI
disabled 21:29:57 svc:/system/iscsitgt:default
# svcadm enable stmf
# svcs stmf
STATE STIME FMRI
online 21:30:55 svc:/system/stmf:default
# stmfadm list-state
Operational Status: online
Config Status : initialized
Next I created a new small zfs volume.
(My zfs pool is named 'rz2pool' - it's Raidz2 - dual parity.)
# zfs create -V 4g rz2pool/iscsi_lun1
And then created a 'logical unit', using the zvol as it's backing store.
# sbdadm create-lu /dev/zvol/rdsk/rz2pool/iscsi_lun1
Created the following LU:
GUID DATA SIZE SOURCE
-------------------------------- ------------------- ----------------
600144f00008278f04694a5cf0980001 4294901760 /dev/zvol/rdsk/rz2pool/iscsi_lun1
# stmfadm list-lu -v
LU Name: 600144F00008278F04694A5CF0980001
Operational Status: Online
Provider Name : sbd
Alias : /dev/zvol/rdsk/rz2pool/iscsi_lun1
View Entry Count : 0
Now using the GUID of the logical unit, we make it visible to any initiators, by adding a 'view'
# stmfadm add-view 600144f00008278f04694a5cf0980001
# stmfadm list-view -l 600144f00008278f04694a5cf0980001
View Entry: 0
Host group : All
Target group : All
LUN : 0
Next we need to do the iscsi part, for which we need to install the iscsi target and the 'itadm' command. These were not already installed so I had to add the 'SUNWiscsit' package.
# pkg install -v SUNWiscsit
Creating Plan | Before evaluation:
UNEVALUATED:
+pkg:/SUNWiscsit@0.5.11,5.11-0.111:20090508T161047Z
After evaluation:
None -> pkg:/SUNWiscsit@0.5.11,5.11-0.111:20090508T161047Z
None -> pkg:/SUNWiscsidm@0.5.11,5.11-0.111:20090508T161041Z
Actuators:
restart_fmri: svc:/system/manifest-import:default
None
DOWNLOAD PKGS FILES XFER (MB)
Completed 2/2 23/23 0.66/0.66
PHASE ACTIONS
Install Phase 76/76
PHASE ITEMS
Reading Existing Index 8/8
Indexing Packages 2/2
# pkg list -v SUNWiscsit
FMRI STATE UFIX
pkg:/SUNWiscsit@0.5.11,5.11-0.111:20090508T161047Z installed ----
# pkg list -v SUNWiscsidm
FMRI STATE UFIX
pkg:/SUNWiscsidm@0.5.11,5.11-0.111:20090508T161041Z installed ----
On the next step I had a small problem.
(BTW, I am not using NWAM to configure my network card.)
# svcadm enable -r iscsi/target:default
svcadm: svc:/milestone/network depends on svc:/network/physical, which has multiple instances.
# svcs -a | grep network/physical
disabled Jul_09 svc:/network/physical:nwam
online Jul_09 svc:/network/physical:default
# svcadm enable iscsi/target
# svcs -a | grep iscsi
disabled Jul_09 svc:/network/iscsi_initiator:default
disabled 21:29:57 svc:/system/iscsitgt:default
maintenance 22:29:13 svc:/network/iscsi/target:default
# svcs -xv
svc:/network/iscsi/target:default (iscsi target)
State: maintenance since Tue Jul 14 22:29:13 2009
Reason: Start method failed repeatedly, last exited with status 4.
See: http://sun.com/msg/SMF-8000-KS
See: man -M /usr/share/man -s 1M itadm
See: /var/svc/log/network-iscsi-target:default.log
Impact: This service is not running.
# tail /var/svc/log/network-iscsi-target\:default.log
[ Jul 14 22:29:13 Enabled. ]
[ Jul 14 22:29:13 Executing start method ("/lib/svc/method/iscsi-target start"). ]
iscsi-target: Requesting to enable iscsi target
open failed: INVALIDUnable to open device /devices/pseudo/iscsit@0:iscsit[ Jul 14 22:29:13 Method "start" exited with status 4. ]
Rebooting solved this problem.
# svcs -a | grep iscsi
disabled 19:25:15 svc:/network/iscsi_initiator:default
disabled 19:25:18 svc:/system/iscsitgt:default
online 19:25:37 svc:/network/iscsi/target:default
So now we create the iscsi target and check it looks ok:
# itadm create-target
Target iqn.1986-03.com.sun:02:7e969860-cb8d-475a-ddbb-97bf2f22ce7b successfully created
# itadm list-target -v
TARGET NAME STATE SESSIONS
iqn.1986-03.com.sun:02:7e969860-cb8d-475a-ddbb-97bf2f22ce7b online 0
alias: -
auth: none (defaults)
targetchapuser: -
targetchapsecret: unset
tpg-tags: default
I was then able to discover and logon to this iscsi target from a Windows 2003 server, using the Microsoft iscsi initiator.
The above is just the simplest possible configuration, just to allow some simple tests. For a more practical setup, you may well want multiple logical units, and want to put some restrictions on the views. But details on how to do that will have to wait for a follow-up post to this blog!
I did notice a couple of things fairly quickly. I saw the connection between the initiator and target was dropping every 50 seconds, followed immediately by a re-logon. Checking with WireShark showed that it was the target starting this with a [FIN, ACK]. With the old iscsi target I would have used DTrace to confirm this, but the iscsi target provider for the new iscsi target was not available.
I looked at the source code for the comstar iscsi target:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/comstar/port/iscsit/iscsit.c
.. and from the history link, I could see that a couple of likely fixes had already been committed:
26-May-2009 Priya Krishnan - 6809997 COMSTAR iscsi target DTrace Provider needed
08-May-2009 Peter Dunlap - 6755803 win2003 initiator numerous iscsi connection lost and connection retries mesgs to iscsi target
Now normally, I would hyper-link those bug number for you. But in this case there is no point, because Sun has chosen not to allows details of those bugs to be publicly available. (Or any of the other comstar iscsi target bugs AFAIK.)
Having those fixes is important for me, so it looks like I will have to update my OpenSolaris 2009-06 using the development repository to snv_118.
Thursday, June 25, 2009
Using SNMP with a Cisco router to troubleshoot & diagnose ADSL problems
We've been having numerous errors and problem with an ADSL line at the company where I work. We're trying to run an Ipsec tunnel across the ADSL line to connect to one of our small branch offices. Every few hours we would have problems, with the tunnel dropping out. Restarting the router would often fix the problem, but the problems never went away for long.
We were using good quality Cisco 1801 routers. The most useful command for debugging is 'show dsl interface atm0' and 'show int atm0'. Below is some typical output, while the ADSL line was faulty:
You can see from the above that the 'Down Stream' (DS) noise margin is very low at 1dB. We have also seen 0.5dB reported, or even 0db! You can also see the error counts are high. In the output of 'show int atm 0', the figure to monitor is the number of CRC errors - in the output above it's 182. It is also useful to set 'debug atm errors' on the cisco. The in the syslog's you will see messages like:
To try and dig deeper, I used SNMP (from a Linux box on the same LAN), to query the router:
If you have the ADSL-LINE-MIB in place, these OID's can be translated:
For a guide on how to use Linux SNMP, see my snmp tutorial.
Many of the figure in the output of 'show dsl interface atm0' are available via SNMP. For example 'Down Stream Noise Margin' is 'adslAturCurrSnrMgn' quoted in tenths of a decibel. So a reading of 15 means 1.5dB. To get an idea of how these values were changing with time, I used Cacti and RRD to poll the SNMP and produce some nice graphs. I Googles and found a Cacti 'ADSL Line Data' templates that allows you to graph Attenuation, Noise margin, and Output Power. Of those, only Noise Margin is interesting, because I found the Attenuation and Output Power to remain constant. But the thing the template did not graph was the number of uncorrectable errors, labelled 'adslAtucChanUncorrectBlks' in the MIB, and this is what I was really interested in, as when we were having problems with the ADSL line, the count of uncorrectable errors was high.
So I modified '/var/www/cacti/resource/snmp_queries/adslLine.xml' to include another OID. I chose 'ADSL-LINE-MIB::adslAtucChanPerfCurr15MinUncorrectBlks' which the MIB decribes as follows:
I modified the XML for the data query as follows:
When I started to graph this value, I was surprised by the range of values I saw. Often it was low, a few hundred, but some time it would shoot up to 60,000! I realised that to graph this range of values it would be better to use a logarithmic scale.
After graphing uncorrectable errors for a while I began to question what does 'within the current 15 minute interval.' actually mean. So let's look in detail at the OID's related to uncorrectable errors. Here are some (high) starting figures:
You can see that "adslAtucChanCorrectedBlks" and "adslAtucChanPerfCurr1DayUncorrectBlks" are identical values, and they were increasing exactly in step, so they are both SNMP 'Counters'. A few minutes latter, we get these reading:
In the period between taking these readings, the '15Min' figures, have reset to zero, and are counting up again. If you keep a constant watch on these figures, you can see that every 15 minutes they reset back to zero. I found the following command to be useful to monitor errors occuring in real time.
So with the benefit of hindsight, I may have been better using Cacti to graph 'adslAtucChanUncorrectBlks' as a 'COUNTER', rather than 'adslAtucChanPerfCurr15MinUncorrectBlks' as a 'GAUGE'.
Eventually we got BT to come on site to check the ADSL line, and they located
a bad connection. Since they fixed that, the noise margin has remained consistently above 5dB and the uncorrected blocks has been very low.
Here are some graph, where you can that the ADSL line has been fixed:
We were using good quality Cisco 1801 routers. The most useful command for debugging is 'show dsl interface atm0' and 'show int atm0'. Below is some typical output, while the ADSL line was faulty:
#show int atm 0
ATM0 is up, line protocol is up
Hardware is MPC ATMSAR (with Alcatel ADSL Module)
MTU 4470 bytes, sub MTU 4470, BW 832 Kbit/sec, DLY 440 usec,
reliability 255/255, txload 41/255, rxload 6/255
Encapsulation ATM, loopback not set
Encapsulation(s): AAL5 AAL2, PVC mode
23 maximum active VCs, 1024 VCs per VP, 1 current VCCs
VC Auto Creation Disabled.
VC idle disconnect time: 300 seconds
Last input never, output 00:00:00, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 48
Queueing strategy: Per VC Queueing
5 minute input rate 22000 bits/sec, 20 packets/sec
5 minute output rate 137000 bits/sec, 24 packets/sec
322340 packets input, 50047091 bytes, 0 no buffer
Received 0 broadcasts, 0 runts, 0 giants, 0 throttles
0 input errors, 182 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
345336 packets output, 145468959 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 output buffer failures, 0 output buffers swapped out
#show dsl interface atm0
ATM0
Alcatel 20190 chipset information
ATU-R (DS) ATU-C (US)
Modem Status: Showtime (DMTDSL_SHOWTIME)
DSL Mode: ITU G.992.1 (G.DMT) Annex A
ITU STD NUM: 0x03 0x2
Vendor ID: 'STMI' 'P '
Vendor Specific: 0x0000 0x0000
Vendor Country: 0x0F 0xB5
Chip ID: C196 (0)
DFE BOM: DFE3.0 Annex A (1)
Capacity Used: 100% 85%
Noise Margin: 1.0 dB 11.0 dB
Output Power: 20.0 dBm 12.5 dBm
Attenuation: 34.0 dB 17.0 dB
Defect Status: None None
Last Fail Code: None
Watchdog Counter: 0xA7
Watchdog Resets: 0
Selftest Result: 0x00
Subfunction: 0x00
Interrupts: 4155 (0 spurious)
PHY Access Err: 0
Activations: 3
LED Status: ON
LED On Time: 100
LED Off Time: 100
Init FW: init_AMR-3.0.014_no_bist.bin
Operation FW: AMR-3.0.014.bin
FW Source: embedded
FW Version: 3.0.14
Interleave Fast Interleave Fast
Speed (kbps): 4640 0 832 0
Cells: 716617 0 108647003 0
Reed-Solomon EC: 43244 0 9 9
CRC Errors: 25362 0 22 14
Header Errors: 15617 0 9 7
Total BER: 1610E-8 0E-0
Leakage Average BER: 2734E-8 0E-0
ATU-R (DS) ATU-C (US)
Bitswap: enabled enabled
Bitswap success: 0 0
Bitswap failure: 0 0
You can see from the above that the 'Down Stream' (DS) noise margin is very low at 1dB. We have also seen 0.5dB reported, or even 0db! You can also see the error counts are high. In the output of 'show int atm 0', the figure to monitor is the number of CRC errors - in the output above it's 182. It is also useful to set 'debug atm errors' on the cisco. The in the syslog's you will see messages like:
*Jun 19 10:22:01.894: ATM0: AAL5 rx errors (status = 0C030000)
*Jun 19 10:22:01.974: ATM0: AAL5 rx errors (status = 2C030000)
*Jun 19 10:22:02.758: ATM0: AAL5 rx errors (status = 0C030000)
*Jun 19 10:22:04.058: ATM0: AAL5 rx errors (status = 2C070000)
*Jun 19 10:22:04.922: ATM0: AAL5 rx errors (status = 0C030000)
*Jun 19 10:22:06.174: ATM0: AAL5 rx errors (status = 0C030000)
*Jun 19 10:22:07.822: ATM0: AAL5 rx errors (status = 0C030000)
To try and dig deeper, I used SNMP (from a Linux box on the same LAN), to query the router:
[root@hexdns1 snmp_queries]# snmpwalk -v1 -c public 192.168.6.1 1.3.6.1.2.1.10.94
SNMPv2-SMI::transmission.94.1.1.1.1.1.19 = INTEGER: 2
SNMPv2-SMI::transmission.94.1.1.1.1.2.19 = INTEGER: 3
SNMPv2-SMI::transmission.94.1.1.2.1.2.19 = STRING: "P"
SNMPv2-SMI::transmission.94.1.1.2.1.3.19 = STRING: "0"
SNMPv2-SMI::transmission.94.1.1.2.1.4.19 = Gauge32: 110
SNMPv2-SMI::transmission.94.1.1.2.1.5.19 = Gauge32: 175
SNMPv2-SMI::transmission.94.1.1.2.1.7.19 = Gauge32: 200
SNMPv2-SMI::transmission.94.1.1.2.1.8.19 = Gauge32: 4492000
SNMPv2-SMI::transmission.94.1.1.3.1.1.19 = ""
SNMPv2-SMI::transmission.94.1.1.3.1.2.19 = STRING: "STMI"
SNMPv2-SMI::transmission.94.1.1.3.1.3.19 = STRING: "0"
SNMPv2-SMI::transmission.94.1.1.3.1.4.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.3.1.5.19 = Gauge32: 340
SNMPv2-SMI::transmission.94.1.1.3.1.7.19 = Gauge32: 130
SNMPv2-SMI::transmission.94.1.1.3.1.8.19 = Gauge32: 967000
SNMPv2-SMI::transmission.94.1.1.4.1.2.19 = Gauge32: 4448000
SNMPv2-SMI::transmission.94.1.1.5.1.2.19 = Gauge32: 832000
SNMPv2-SMI::transmission.94.1.1.6.1.1.19 = Counter32: 697
SNMPv2-SMI::transmission.94.1.1.6.1.2.19 = Counter32: 449
SNMPv2-SMI::transmission.94.1.1.6.1.4.19 = Counter32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.5.19 = Counter32: 1820
SNMPv2-SMI::transmission.94.1.1.6.1.6.19 = Counter32: 31
SNMPv2-SMI::transmission.94.1.1.6.1.10.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.11.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.13.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.14.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.15.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.17.19 = Gauge32: 565
SNMPv2-SMI::transmission.94.1.1.6.1.18.19 = Gauge32: 381
SNMPv2-SMI::transmission.94.1.1.6.1.20.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.21.19 = Gauge32: 1034
SNMPv2-SMI::transmission.94.1.1.6.1.22.19 = Gauge32: 10
SNMPv2-SMI::transmission.94.1.1.7.1.1.19 = Counter32: 45
SNMPv2-SMI::transmission.94.1.1.7.1.2.19 = Counter32: 28
SNMPv2-SMI::transmission.94.1.1.7.1.3.19 = Counter32: 0
SNMPv2-SMI::transmission.94.1.1.7.1.4.19 = Counter32: 15043
SNMPv2-SMI::transmission.94.1.1.7.1.8.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.7.1.9.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.7.1.10.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.7.1.11.19 = Gauge32: 328
SNMPv2-SMI::transmission.94.1.1.7.1.13.19 = Gauge32: 44
SNMPv2-SMI::transmission.94.1.1.7.1.14.19 = Gauge32: 27
SNMPv2-SMI::transmission.94.1.1.7.1.15.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.7.1.16.19 = Gauge32: 3717
SNMPv2-SMI::transmission.94.1.1.10.1.3.19 = Counter32: 1864592813
SNMPv2-SMI::transmission.94.1.1.10.1.4.19 = Counter32: 804238
SNMPv2-SMI::transmission.94.1.1.10.1.10.19 = Gauge32: 2538597
SNMPv2-SMI::transmission.94.1.1.10.1.11.19 = Gauge32: 11493
SNMPv2-SMI::transmission.94.1.1.10.1.15.19 = Gauge32: 804265846
SNMPv2-SMI::transmission.94.1.1.10.1.16.19 = Gauge32: 400641
SNMPv2-SMI::transmission.94.1.1.11.1.3.19 = Counter32: 31565
SNMPv2-SMI::transmission.94.1.1.11.1.4.19 = Counter32: 56466
SNMPv2-SMI::transmission.94.1.1.11.1.10.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.11.1.11.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.11.1.15.19 = Gauge32: 18799
SNMPv2-SMI::transmission.94.1.1.11.1.16.19 = Gauge32: 44213
If you have the ADSL-LINE-MIB in place, these OID's can be translated:
# snmptranslate 1.3.6.1.2.1.10.94
SNMPv2-SMI::transmission.94
# export MIBS=ALL
# snmptranslate 1.3.6.1.2.1.10.94
ADSL-LINE-MIB::adslMIB
For a guide on how to use Linux SNMP, see my snmp tutorial.
# snmpwalk -v1 -c public 192.168.6.1 1.3.6.1.2.1.10.94
ADSL-LINE-MIB::adslLineCoding.19 = INTEGER: 2
ADSL-LINE-MIB::adslLineType.19 = INTEGER: interleavedOnly(3)
ADSL-LINE-MIB::adslAtucInvVendorID.19 = STRING: P
ADSL-LINE-MIB::adslAtucInvVersionNumber.19 = STRING: 0
ADSL-LINE-MIB::adslAtucCurrSnrMgn.19 = Wrong Type (should be INTEGER): Gauge32: 110
ADSL-LINE-MIB::adslAtucCurrAtn.19 = Gauge32: 170 tenth dB
ADSL-LINE-MIB::adslAtucCurrOutputPwr.19 = Wrong Type (should be INTEGER): Gauge32: 200
ADSL-LINE-MIB::adslAtucCurrAttainableRate.19 = Gauge32: 4686000 bps
ADSL-LINE-MIB::adslAturInvSerialNumber.19 = STRING:
ADSL-LINE-MIB::adslAturInvVendorID.19 = STRING: STMI
ADSL-LINE-MIB::adslAturInvVersionNumber.19 = STRING: 0
ADSL-LINE-MIB::adslAturCurrSnrMgn.19 = Wrong Type (should be INTEGER): Gauge32: 15
ADSL-LINE-MIB::adslAturCurrAtn.19 = Gauge32: 340 tenth dB
ADSL-LINE-MIB::adslAturCurrOutputPwr.19 = Wrong Type (should be INTEGER): Gauge32: 130
ADSL-LINE-MIB::adslAturCurrAttainableRate.19 = Gauge32: 978000 bps
ADSL-LINE-MIB::adslAtucChanCurrTxRate.19 = Gauge32: 4640000 bps
ADSL-LINE-MIB::adslAturChanCurrTxRate.19 = Gauge32: 832000 bps
ADSL-LINE-MIB::adslAtucPerfLofs.19 = Counter32: 453
ADSL-LINE-MIB::adslAtucPerfLoss.19 = Counter32: 13
ADSL-LINE-MIB::adslAtucPerfLprs.19 = Counter32: 0
ADSL-LINE-MIB::adslAtucPerfESs.19 = Counter32: 1997
ADSL-LINE-MIB::adslAtucPerfInits.19 = Counter32: 18
ADSL-LINE-MIB::adslAtucPerfCurr15MinLofs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAtucPerfCurr15MinLoss.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAtucPerfCurr15MinLprs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAtucPerfCurr15MinESs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAtucPerfCurr15MinInits.19 = Gauge32: 0
ADSL-LINE-MIB::adslAtucPerfCurr1DayLofs.19 = Gauge32: 445 seconds
ADSL-LINE-MIB::adslAtucPerfCurr1DayLoss.19 = Gauge32: 14 seconds
ADSL-LINE-MIB::adslAtucPerfCurr1DayLprs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAtucPerfCurr1DayESs.19 = Gauge32: 1425 seconds
ADSL-LINE-MIB::adslAtucPerfCurr1DayInits.19 = Gauge32: 9
ADSL-LINE-MIB::adslAturPerfLofs.19 = Counter32: 22 seconds
ADSL-LINE-MIB::adslAturPerfLoss.19 = Counter32: 15 seconds
ADSL-LINE-MIB::adslAturPerfLprs.19 = Counter32: 0 seconds
ADSL-LINE-MIB::adslAturPerfESs.19 = Counter32: 95908 seconds
ADSL-LINE-MIB::adslAturPerfCurr15MinLofs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAturPerfCurr15MinLoss.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAturPerfCurr15MinLprs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAturPerfCurr15MinESs.19 = Gauge32: 345 seconds
ADSL-LINE-MIB::adslAturPerfCurr1DayLofs.19 = Gauge32: 22 seconds
ADSL-LINE-MIB::adslAturPerfCurr1DayLoss.19 = Gauge32: 15 seconds
ADSL-LINE-MIB::adslAturPerfCurr1DayLprs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAturPerfCurr1DayESs.19 = Gauge32: 21932 seconds
ADSL-LINE-MIB::adslAtucChanCorrectedBlks.19 = Counter32: 4045523380
ADSL-LINE-MIB::adslAtucChanUncorrectBlks.19 = Counter32: 23129426
ADSL-LINE-MIB::adslAtucChanPerfCurr15MinCorrectedBlks.19 = Gauge32: 6278668
ADSL-LINE-MIB::adslAtucChanPerfCurr15MinUncorrectBlks.19 = Gauge32: 17489
ADSL-LINE-MIB::adslAtucChanPerfCurr1DayCorrectedBlks.19 = Gauge32: 1783723602
ADSL-LINE-MIB::adslAtucChanPerfCurr1DayUncorrectBlks.19 = Gauge32: 10061120
ADSL-LINE-MIB::adslAturChanCorrectedBlks.19 = Counter32: 4235
ADSL-LINE-MIB::adslAturChanUncorrectBlks.19 = Counter32: 9371
ADSL-LINE-MIB::adslAturChanPerfCurr15MinCorrectedBlks.19 = Gauge32: 0
ADSL-LINE-MIB::adslAturChanPerfCurr15MinUncorrectBlks.19 = Gauge32: 0
ADSL-LINE-MIB::adslAturChanPerfCurr1DayCorrectedBlks.19 = Gauge32: 1538
ADSL-LINE-MIB::adslAturChanPerfCurr1DayUncorrectBlks.19 = Gauge32: 5374
Many of the figure in the output of 'show dsl interface atm0' are available via SNMP. For example 'Down Stream Noise Margin' is 'adslAturCurrSnrMgn' quoted in tenths of a decibel. So a reading of 15 means 1.5dB. To get an idea of how these values were changing with time, I used Cacti and RRD to poll the SNMP and produce some nice graphs. I Googles and found a Cacti 'ADSL Line Data' templates that allows you to graph Attenuation, Noise margin, and Output Power. Of those, only Noise Margin is interesting, because I found the Attenuation and Output Power to remain constant. But the thing the template did not graph was the number of uncorrectable errors, labelled 'adslAtucChanUncorrectBlks' in the MIB, and this is what I was really interested in, as when we were having problems with the ADSL line, the count of uncorrectable errors was high.
So I modified '/var/www/cacti/resource/snmp_queries/adslLine.xml' to include another OID. I chose 'ADSL-LINE-MIB::adslAtucChanPerfCurr15MinUncorrectBlks' which the MIB decribes as follows:
# snmptranslate -Td ADSL-LINE-MIB::adslAtucChanPerfCurr15MinUncorrectBlks
ADSL-LINE-MIB::adslAtucChanPerfCurr15MinUncorrectBlks
adslAtucChanPerfCurr15MinUncorrectBlks OBJECT-TYPE
-- FROM ADSL-LINE-MIB
MAX-ACCESS read-only
STATUS current
DESCRIPTION "Count of all blocks received with uncorrectable
errors on this channel within the current 15 minute
interval."
I modified the XML for the data query as follows:
# snmptranslate -On ADSL-LINE-MIB::adslAtucChanPerfCurr15MinUncorrectBlks
.1.3.6.1.2.1.10.94.1.1.10.1.11
# diff -u adslLine.orig.xml adslLine.xml
--- adslLine.orig.xml 2009-05-01 16:57:18.000000000 +0100
+++ adslLine.xml 2009-06-15 18:53:58.000000000 +0100
@@ -27,6 +27,13 @@
<direction>input</direction>
<oid>.1.3.6.1.2.1.10.94.1.1.3.1.2</oid>
</adslAtucInvVendorID>
+ <adslAtuc15mUncrBlks>
+ <name>Local Curr 15min Uncorrectable Blocks</name>
+ <method>walk</method>
+ <source>value</source>
+ <direction>output</direction>
+ <oid>.1.3.6.1.2.1.10.94.1.1.10.1.11</oid>
+ </adslAtuc15mUncrBlks>
<adslAtucCurrSnrMgn>
<name>Local Noise Margin (tenth dB)</name>
<method>walk</method>
When I started to graph this value, I was surprised by the range of values I saw. Often it was low, a few hundred, but some time it would shoot up to 60,000! I realised that to graph this range of values it would be better to use a logarithmic scale.
After graphing uncorrectable errors for a while I began to question what does 'within the current 15 minute interval.' actually mean. So let's look in detail at the OID's related to uncorrectable errors. Here are some (high) starting figures:
# snmpwalk -v1 -c public 192.168.6.1 1.3.6.1.2.1.10.94.1.1.10.1
SNMPv2-SMI::transmission.94.1.1.10.1.3.19 = Counter32: 293561906
SNMPv2-SMI::transmission.94.1.1.10.1.4.19 = Counter32: 1493720
SNMPv2-SMI::transmission.94.1.1.10.1.10.19 = Gauge32: 6671894
SNMPv2-SMI::transmission.94.1.1.10.1.11.19 = Gauge32: 80028
SNMPv2-SMI::transmission.94.1.1.10.1.15.19 = Gauge32: 293561906
SNMPv2-SMI::transmission.94.1.1.10.1.16.19 = Gauge32: 1493720
You can see that "adslAtucChanCorrectedBlks" and "adslAtucChanPerfCurr1DayUncorrectBlks" are identical values, and they were increasing exactly in step, so they are both SNMP 'Counters'. A few minutes latter, we get these reading:
# snmpwalk -v1 -c public 192.168.6.1 1.3.6.1.2.1.10.94.1.1.10.1
SNMPv2-SMI::transmission.94.1.1.10.1.3.19 = Counter32: 293628501
SNMPv2-SMI::transmission.94.1.1.10.1.4.19 = Counter32: 1495372
SNMPv2-SMI::transmission.94.1.1.10.1.10.19 = Gauge32: 27251
SNMPv2-SMI::transmission.94.1.1.10.1.11.19 = Gauge32: 1383
SNMPv2-SMI::transmission.94.1.1.10.1.15.19 = Gauge32: 293628501
SNMPv2-SMI::transmission.94.1.1.10.1.16.19 = Gauge32: 1495372
In the period between taking these readings, the '15Min' figures, have reset to zero, and are counting up again. If you keep a constant watch on these figures, you can see that every 15 minutes they reset back to zero. I found the following command to be useful to monitor errors occuring in real time.
# watch -n 5 'snmpwalk -v1 -c public 192.168.6.1 1.3.6.1.2.1.10.94.1.1.10.1'
So with the benefit of hindsight, I may have been better using Cacti to graph 'adslAtucChanUncorrectBlks' as a 'COUNTER', rather than 'adslAtucChanPerfCurr15MinUncorrectBlks' as a 'GAUGE'.
Eventually we got BT to come on site to check the ADSL line, and they located
a bad connection. Since they fixed that, the noise margin has remained consistently above 5dB and the uncorrected blocks has been very low.
Here are some graph, where you can that the ADSL line has been fixed:
Subscribe to:
Posts (Atom)