Wednesday, July 29, 2009

Microsoft Office Communicator 2007 R2 and DSCP QoS

Anyone trying to get DSCP working with Office Communicator 2007 R2 on Windows XP needs to grab the 'July' update (v3.5.6907.37) otherwise you will be having problems getting video to work. Here are the links:

"Description of the update for Communicator 2007 R2: July 2009"
http://support.microsoft.com/kb/969695/

"Video frames are not displayed in Office Communicator 2007 R2 on a Windows XP-based computer"
http://support.microsoft.com/kb/971846/

(BTW, I'd like to thank Michael Melling for spotting those links, and for helping with the testing that I describe below.)

Before the update, with Office Communicator versions v3.5.6907.0 or v3.5.6907.34, on Windows XP, after setting QoSEnabled to 1, you find audio continues to work fine, but video fails. Most of the time we just got a black video windows, sometimes it was a grey window, and sometimes a frozen image.

Using OCS's "Monitoring Server Reports" you can see whats going wrong. Click the "User Call List" link, and then filter by one of the users in the call, and this will give you a detailed statistical report of all finished calls. For the Video Stream you will see a "Packet loss Rate" of about 60% !

Office Communicator, with QoSEnabled, works fine on Vista and Windows 7, without this 'July' update. I think that Microsoft must have done the majority of their testing with their latest operating systems, and have only recently got around to testing with Windows XP.

We have been battling this problem for a couple of weeks, so this fix comes just in time! In trying to investigate what was goingwrong, we have learned a great deal, so not all the time was wasted, just some!

DSCP QoS is enabled in the registry, like this:

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\RTC\Transport
"QoSEnabled"=dword:00000001

.. as per these instructions:

"Enabling DSCP Marking"
http://technet.microsoft.com/en-us/library/dd441192(office.13).aspx

BTW, you need to exit (not just close) Office Communicator and restart it, for this registry setting to take effect.

Once Office Communicator has setup the call using SIP (over TCP), using Office Communications Server 2007 R2, all audio & video network traffic is sent directly between the two clients using RTP (Real Time Protocol) packets in UDP packets.

Using WireShark, we could see the packets being tagged correctly:

Audio - DSCP 0x28: Class Selector CS5
Video - DSCP 0x18: Class Selector CS3

BTW, these are the default values. You can change them using a group policy.

In WireShark, to just grab the audio & video traffic, use a capture filter like this:

udp and ip[1]!=0

This will match any UDP packets that have a non-zero DSCP field in the IP header.You can use the following display filters to select either Audio or Video packets:

For Audio, use:
ip.dsfield.dscp == 40

For Video, use:
ip.dsfield.dscp == 24

BTW, if you use tcpdump to capture the packets, remember that the DSCP values are multiplied by four. So use these capture filters with tcpdump:

For Audio:
ip[1]=0xA0
ip[1]=160

For Video:
ip[1]=0x60
ip[1]=96

It is very useful to be able to see the RTP headers in the packets. To do this in WireShark, right-click on any of the packets of a UDP stream, and select 'Decode As..', the select 'Transport' as RTP.

Once you have WireShark decoding the RTP header, you can see a couple of interesting fields in the header. The first is 'Payload type',and this is set as follows:

Type 97 - RT-Audio encoded redundant data
Type 111 - 'Siren' audio - used by 'LiveMeeting'
Type 114 - RTaudio (x-msrta)
Type 118 - 'Comfort Noise'
Type 121 - RTvideo (x-rtvc1)

The second interesting RTP header field is the 'Sequence number'. This is a 16-bit counter field, that increments by 1 for each packet sent in a particular stream, and is useful for detecting dropped or duplicated packets. (The sequence number seems to start at a random value.)

There is a useful feature in WireShark which will do 'Stream Anaysis..'. On WireShark v1.2.0 you will find this under the 'Telephony > RTP' menu. This will highlight where the sequence number goes wrong.

Ok, so why was the video failing, (before the 'July' update for Office Communication). We had two laptops, running Office Communicator, connected together via a HP Procurve switch,and enabled port mirroring on the switch, to a third laptop to see all network traffic. Running WireShark on the client sending video, we saw a lot more video packet, then were being received at the other end, on the receiving client. The third monitoring laptop confirmed that packets were being dropped before they emerged 'onto the wire' from the sending laptop. Looking closely at the WireShark capture from the laptop sending video, and using the 'Stream Analysis..' we saw many video packets with duplicate RTP sequence numbers! Correlating with the capture on the receiving client, we saw that any packet that had been duplicated was missing. It was not a case of just the duplicated packet being missing, but also the original packet. Thus from the point of view of the receiving packet, there was a massive level of packet loss, hence it could not display the video stream.

BTW, on Windows XP, you can use 'tcmon.exe' from the resource kit to see what QoS policies are in effect, and this confirms packet loss onthe video stream. Here are some screen shots:




Unfortunately tcmon does not run on Vista or Windows 7, which is very disapointing. Let's hope that Microsoft get around to fixing this!

BTW, if you are using Linux to priorities the audio & video traffic as it is sent into the WAN, like we are, you will find the following iptables & 'tc filters' useful to match the packets:

# Audio:
iptables -t mangle -A POSTROUTING -o eth0 -p udp -m dscp --dscp 40 -j mark1
iptables -t mangle -A POSTROUTING -o eth0 -p udp -m dscp --dscp-class CS5 -j mark1
tc filter add dev eth0 parent 10:0 prio 1 protocol ip u32 match ip tos 0xA0 0xff flowid 10:200

# Video:
iptables -t mangle -A POSTROUTING -o eth0 -p udp -m dscp --dscp 24 -j mark2
iptables -t mangle -A POSTROUTING -o eth0 -p udp -m dscp --dscp-class CS3 -j mark2
tc filter add dev eth0 parent 10:0 prio 1 protocol ip u32 match ip tos 0x60 0xff flowid 10:210

Best Regards
Nigel Smith

OpenSolaris 2009-06 and the Comstar iScsi Target

Having upgraded from OpenSolaris 2008-11 to 2009-06, I wanted to try the new Comstar iscsi target, to see how well it worked, and to see if it was ready to replace the 'old' iscsi target. And I wanted to see how far I could get with the package versions in the 'production-quality' release repository, based on snv_111. To start with, I just wanted to get something basic going as quickly as possible.

The first tasks was to disable the old iscsi target (to avoid a conflict with the new target which would also want to listen on TCP port 3260), and to enable the 'SCSI Target Mode Framework' - stmf.

# svcs iscsitgt
online Jul_09 svc:/system/iscsitgt:default
# svcs stmf
disabled Jul_09 svc:/system/stmf:default
# svcadm disable iscsitgt
# svcs iscsitgt
STATE STIME FMRI
disabled 21:29:57 svc:/system/iscsitgt:default
# svcadm enable stmf
# svcs stmf
STATE STIME FMRI
online 21:30:55 svc:/system/stmf:default
# stmfadm list-state
Operational Status: online
Config Status : initialized

Next I created a new small zfs volume.
(My zfs pool is named 'rz2pool' - it's Raidz2 - dual parity.)

# zfs create -V 4g rz2pool/iscsi_lun1

And then created a 'logical unit', using the zvol as it's backing store.

# sbdadm create-lu /dev/zvol/rdsk/rz2pool/iscsi_lun1

Created the following LU:

GUID DATA SIZE SOURCE
-------------------------------- ------------------- ----------------
600144f00008278f04694a5cf0980001 4294901760 /dev/zvol/rdsk/rz2pool/iscsi_lun1

# stmfadm list-lu -v
LU Name: 600144F00008278F04694A5CF0980001
Operational Status: Online
Provider Name : sbd
Alias : /dev/zvol/rdsk/rz2pool/iscsi_lun1
View Entry Count : 0

Now using the GUID of the logical unit, we make it visible to any initiators, by adding a 'view'

# stmfadm add-view 600144f00008278f04694a5cf0980001
# stmfadm list-view -l 600144f00008278f04694a5cf0980001
View Entry: 0
Host group : All
Target group : All
LUN : 0

Next we need to do the iscsi part, for which we need to install the iscsi target and the 'itadm' command. These were not already installed so I had to add the 'SUNWiscsit' package.

# pkg install -v SUNWiscsit
Creating Plan | Before evaluation:
UNEVALUATED:
+pkg:/SUNWiscsit@0.5.11,5.11-0.111:20090508T161047Z

After evaluation:
None -> pkg:/SUNWiscsit@0.5.11,5.11-0.111:20090508T161047Z
None -> pkg:/SUNWiscsidm@0.5.11,5.11-0.111:20090508T161041Z
Actuators:
restart_fmri: svc:/system/manifest-import:default
None
DOWNLOAD PKGS FILES XFER (MB)
Completed 2/2 23/23 0.66/0.66

PHASE ACTIONS
Install Phase 76/76
PHASE ITEMS
Reading Existing Index 8/8
Indexing Packages 2/2

# pkg list -v SUNWiscsit
FMRI STATE UFIX
pkg:/SUNWiscsit@0.5.11,5.11-0.111:20090508T161047Z installed ----
# pkg list -v SUNWiscsidm
FMRI STATE UFIX
pkg:/SUNWiscsidm@0.5.11,5.11-0.111:20090508T161041Z installed ----

On the next step I had a small problem.
(BTW, I am not using NWAM to configure my network card.)

# svcadm enable -r iscsi/target:default
svcadm: svc:/milestone/network depends on svc:/network/physical, which has multiple instances.

# svcs -a | grep network/physical
disabled Jul_09 svc:/network/physical:nwam
online Jul_09 svc:/network/physical:default

# svcadm enable iscsi/target
# svcs -a | grep iscsi
disabled Jul_09 svc:/network/iscsi_initiator:default
disabled 21:29:57 svc:/system/iscsitgt:default
maintenance 22:29:13 svc:/network/iscsi/target:default

# svcs -xv
svc:/network/iscsi/target:default (iscsi target)
State: maintenance since Tue Jul 14 22:29:13 2009
Reason: Start method failed repeatedly, last exited with status 4.
See: http://sun.com/msg/SMF-8000-KS
See: man -M /usr/share/man -s 1M itadm
See: /var/svc/log/network-iscsi-target:default.log
Impact: This service is not running.

# tail /var/svc/log/network-iscsi-target\:default.log
[ Jul 14 22:29:13 Enabled. ]
[ Jul 14 22:29:13 Executing start method ("/lib/svc/method/iscsi-target start"). ]
iscsi-target: Requesting to enable iscsi target
open failed: INVALIDUnable to open device /devices/pseudo/iscsit@0:iscsit[ Jul 14 22:29:13 Method "start" exited with status 4. ]

Rebooting solved this problem.

# svcs -a | grep iscsi
disabled 19:25:15 svc:/network/iscsi_initiator:default
disabled 19:25:18 svc:/system/iscsitgt:default
online 19:25:37 svc:/network/iscsi/target:default

So now we create the iscsi target and check it looks ok:

# itadm create-target
Target iqn.1986-03.com.sun:02:7e969860-cb8d-475a-ddbb-97bf2f22ce7b successfully created

# itadm list-target -v
TARGET NAME STATE SESSIONS
iqn.1986-03.com.sun:02:7e969860-cb8d-475a-ddbb-97bf2f22ce7b online 0
alias: -
auth: none (defaults)
targetchapuser: -
targetchapsecret: unset
tpg-tags: default

I was then able to discover and logon to this iscsi target from a Windows 2003 server, using the Microsoft iscsi initiator.

The above is just the simplest possible configuration, just to allow some simple tests. For a more practical setup, you may well want multiple logical units, and want to put some restrictions on the views. But details on how to do that will have to wait for a follow-up post to this blog!

I did notice a couple of things fairly quickly. I saw the connection between the initiator and target was dropping every 50 seconds, followed immediately by a re-logon. Checking with WireShark showed that it was the target starting this with a [FIN, ACK]. With the old iscsi target I would have used DTrace to confirm this, but the iscsi target provider for the new iscsi target was not available.

I looked at the source code for the comstar iscsi target:

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/comstar/port/iscsit/iscsit.c

.. and from the history link, I could see that a couple of likely fixes had already been committed:

26-May-2009 Priya Krishnan - 6809997 COMSTAR iscsi target DTrace Provider needed
08-May-2009 Peter Dunlap - 6755803 win2003 initiator numerous iscsi connection lost and connection retries mesgs to iscsi target

Now normally, I would hyper-link those bug number for you. But in this case there is no point, because Sun has chosen not to allows details of those bugs to be publicly available. (Or any of the other comstar iscsi target bugs AFAIK.)

Having those fixes is important for me, so it looks like I will have to update my OpenSolaris 2009-06 using the development repository to snv_118.

Thursday, June 25, 2009

Using SNMP with a Cisco router to troubleshoot & diagnose ADSL problems

We've been having numerous errors and problem with an ADSL line at the company where I work. We're trying to run an Ipsec tunnel across the ADSL line to connect to one of our small branch offices. Every few hours we would have problems, with the tunnel dropping out. Restarting the router would often fix the problem, but the problems never went away for long.

We were using good quality Cisco 1801 routers. The most useful command for debugging is 'show dsl interface atm0' and 'show int atm0'. Below is some typical output, while the ADSL line was faulty:

#show int atm 0
ATM0 is up, line protocol is up
Hardware is MPC ATMSAR (with Alcatel ADSL Module)
MTU 4470 bytes, sub MTU 4470, BW 832 Kbit/sec, DLY 440 usec,
reliability 255/255, txload 41/255, rxload 6/255
Encapsulation ATM, loopback not set
Encapsulation(s): AAL5 AAL2, PVC mode
23 maximum active VCs, 1024 VCs per VP, 1 current VCCs
VC Auto Creation Disabled.
VC idle disconnect time: 300 seconds
Last input never, output 00:00:00, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 48
Queueing strategy: Per VC Queueing
5 minute input rate 22000 bits/sec, 20 packets/sec
5 minute output rate 137000 bits/sec, 24 packets/sec
322340 packets input, 50047091 bytes, 0 no buffer
Received 0 broadcasts, 0 runts, 0 giants, 0 throttles
0 input errors, 182 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
345336 packets output, 145468959 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 output buffer failures, 0 output buffers swapped out

#show dsl interface atm0
ATM0
Alcatel 20190 chipset information
ATU-R (DS) ATU-C (US)
Modem Status: Showtime (DMTDSL_SHOWTIME)
DSL Mode: ITU G.992.1 (G.DMT) Annex A
ITU STD NUM: 0x03 0x2
Vendor ID: 'STMI' 'P '
Vendor Specific: 0x0000 0x0000
Vendor Country: 0x0F 0xB5
Chip ID: C196 (0)
DFE BOM: DFE3.0 Annex A (1)
Capacity Used: 100% 85%
Noise Margin: 1.0 dB 11.0 dB
Output Power: 20.0 dBm 12.5 dBm
Attenuation: 34.0 dB 17.0 dB
Defect Status: None None
Last Fail Code: None
Watchdog Counter: 0xA7
Watchdog Resets: 0
Selftest Result: 0x00
Subfunction: 0x00
Interrupts: 4155 (0 spurious)
PHY Access Err: 0
Activations: 3
LED Status: ON
LED On Time: 100
LED Off Time: 100
Init FW: init_AMR-3.0.014_no_bist.bin
Operation FW: AMR-3.0.014.bin
FW Source: embedded
FW Version: 3.0.14

Interleave Fast Interleave Fast
Speed (kbps): 4640 0 832 0
Cells: 716617 0 108647003 0
Reed-Solomon EC: 43244 0 9 9
CRC Errors: 25362 0 22 14
Header Errors: 15617 0 9 7
Total BER: 1610E-8 0E-0
Leakage Average BER: 2734E-8 0E-0
ATU-R (DS) ATU-C (US)
Bitswap: enabled enabled
Bitswap success: 0 0
Bitswap failure: 0 0

You can see from the above that the 'Down Stream' (DS) noise margin is very low at 1dB. We have also seen 0.5dB reported, or even 0db! You can also see the error counts are high. In the output of 'show int atm 0', the figure to monitor is the number of CRC errors - in the output above it's 182. It is also useful to set 'debug atm errors' on the cisco. The in the syslog's you will see messages like:

*Jun 19 10:22:01.894: ATM0: AAL5 rx errors (status = 0C030000)
*Jun 19 10:22:01.974: ATM0: AAL5 rx errors (status = 2C030000)
*Jun 19 10:22:02.758: ATM0: AAL5 rx errors (status = 0C030000)
*Jun 19 10:22:04.058: ATM0: AAL5 rx errors (status = 2C070000)
*Jun 19 10:22:04.922: ATM0: AAL5 rx errors (status = 0C030000)
*Jun 19 10:22:06.174: ATM0: AAL5 rx errors (status = 0C030000)
*Jun 19 10:22:07.822: ATM0: AAL5 rx errors (status = 0C030000)

To try and dig deeper, I used SNMP (from a Linux box on the same LAN), to query the router:

[root@hexdns1 snmp_queries]# snmpwalk -v1 -c public 192.168.6.1 1.3.6.1.2.1.10.94
SNMPv2-SMI::transmission.94.1.1.1.1.1.19 = INTEGER: 2
SNMPv2-SMI::transmission.94.1.1.1.1.2.19 = INTEGER: 3
SNMPv2-SMI::transmission.94.1.1.2.1.2.19 = STRING: "P"
SNMPv2-SMI::transmission.94.1.1.2.1.3.19 = STRING: "0"
SNMPv2-SMI::transmission.94.1.1.2.1.4.19 = Gauge32: 110
SNMPv2-SMI::transmission.94.1.1.2.1.5.19 = Gauge32: 175
SNMPv2-SMI::transmission.94.1.1.2.1.7.19 = Gauge32: 200
SNMPv2-SMI::transmission.94.1.1.2.1.8.19 = Gauge32: 4492000
SNMPv2-SMI::transmission.94.1.1.3.1.1.19 = ""
SNMPv2-SMI::transmission.94.1.1.3.1.2.19 = STRING: "STMI"
SNMPv2-SMI::transmission.94.1.1.3.1.3.19 = STRING: "0"
SNMPv2-SMI::transmission.94.1.1.3.1.4.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.3.1.5.19 = Gauge32: 340
SNMPv2-SMI::transmission.94.1.1.3.1.7.19 = Gauge32: 130
SNMPv2-SMI::transmission.94.1.1.3.1.8.19 = Gauge32: 967000
SNMPv2-SMI::transmission.94.1.1.4.1.2.19 = Gauge32: 4448000
SNMPv2-SMI::transmission.94.1.1.5.1.2.19 = Gauge32: 832000
SNMPv2-SMI::transmission.94.1.1.6.1.1.19 = Counter32: 697
SNMPv2-SMI::transmission.94.1.1.6.1.2.19 = Counter32: 449
SNMPv2-SMI::transmission.94.1.1.6.1.4.19 = Counter32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.5.19 = Counter32: 1820
SNMPv2-SMI::transmission.94.1.1.6.1.6.19 = Counter32: 31
SNMPv2-SMI::transmission.94.1.1.6.1.10.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.11.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.13.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.14.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.15.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.17.19 = Gauge32: 565
SNMPv2-SMI::transmission.94.1.1.6.1.18.19 = Gauge32: 381
SNMPv2-SMI::transmission.94.1.1.6.1.20.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.6.1.21.19 = Gauge32: 1034
SNMPv2-SMI::transmission.94.1.1.6.1.22.19 = Gauge32: 10
SNMPv2-SMI::transmission.94.1.1.7.1.1.19 = Counter32: 45
SNMPv2-SMI::transmission.94.1.1.7.1.2.19 = Counter32: 28
SNMPv2-SMI::transmission.94.1.1.7.1.3.19 = Counter32: 0
SNMPv2-SMI::transmission.94.1.1.7.1.4.19 = Counter32: 15043
SNMPv2-SMI::transmission.94.1.1.7.1.8.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.7.1.9.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.7.1.10.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.7.1.11.19 = Gauge32: 328
SNMPv2-SMI::transmission.94.1.1.7.1.13.19 = Gauge32: 44
SNMPv2-SMI::transmission.94.1.1.7.1.14.19 = Gauge32: 27
SNMPv2-SMI::transmission.94.1.1.7.1.15.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.7.1.16.19 = Gauge32: 3717
SNMPv2-SMI::transmission.94.1.1.10.1.3.19 = Counter32: 1864592813
SNMPv2-SMI::transmission.94.1.1.10.1.4.19 = Counter32: 804238
SNMPv2-SMI::transmission.94.1.1.10.1.10.19 = Gauge32: 2538597
SNMPv2-SMI::transmission.94.1.1.10.1.11.19 = Gauge32: 11493
SNMPv2-SMI::transmission.94.1.1.10.1.15.19 = Gauge32: 804265846
SNMPv2-SMI::transmission.94.1.1.10.1.16.19 = Gauge32: 400641
SNMPv2-SMI::transmission.94.1.1.11.1.3.19 = Counter32: 31565
SNMPv2-SMI::transmission.94.1.1.11.1.4.19 = Counter32: 56466
SNMPv2-SMI::transmission.94.1.1.11.1.10.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.11.1.11.19 = Gauge32: 0
SNMPv2-SMI::transmission.94.1.1.11.1.15.19 = Gauge32: 18799
SNMPv2-SMI::transmission.94.1.1.11.1.16.19 = Gauge32: 44213

If you have the ADSL-LINE-MIB in place, these OID's can be translated:

# snmptranslate 1.3.6.1.2.1.10.94
SNMPv2-SMI::transmission.94
# export MIBS=ALL
# snmptranslate 1.3.6.1.2.1.10.94
ADSL-LINE-MIB::adslMIB

For a guide on how to use Linux SNMP, see my snmp tutorial.

# snmpwalk -v1 -c public 192.168.6.1 1.3.6.1.2.1.10.94
ADSL-LINE-MIB::adslLineCoding.19 = INTEGER: 2
ADSL-LINE-MIB::adslLineType.19 = INTEGER: interleavedOnly(3)
ADSL-LINE-MIB::adslAtucInvVendorID.19 = STRING: P
ADSL-LINE-MIB::adslAtucInvVersionNumber.19 = STRING: 0
ADSL-LINE-MIB::adslAtucCurrSnrMgn.19 = Wrong Type (should be INTEGER): Gauge32: 110
ADSL-LINE-MIB::adslAtucCurrAtn.19 = Gauge32: 170 tenth dB
ADSL-LINE-MIB::adslAtucCurrOutputPwr.19 = Wrong Type (should be INTEGER): Gauge32: 200
ADSL-LINE-MIB::adslAtucCurrAttainableRate.19 = Gauge32: 4686000 bps
ADSL-LINE-MIB::adslAturInvSerialNumber.19 = STRING:
ADSL-LINE-MIB::adslAturInvVendorID.19 = STRING: STMI
ADSL-LINE-MIB::adslAturInvVersionNumber.19 = STRING: 0
ADSL-LINE-MIB::adslAturCurrSnrMgn.19 = Wrong Type (should be INTEGER): Gauge32: 15
ADSL-LINE-MIB::adslAturCurrAtn.19 = Gauge32: 340 tenth dB
ADSL-LINE-MIB::adslAturCurrOutputPwr.19 = Wrong Type (should be INTEGER): Gauge32: 130
ADSL-LINE-MIB::adslAturCurrAttainableRate.19 = Gauge32: 978000 bps
ADSL-LINE-MIB::adslAtucChanCurrTxRate.19 = Gauge32: 4640000 bps
ADSL-LINE-MIB::adslAturChanCurrTxRate.19 = Gauge32: 832000 bps
ADSL-LINE-MIB::adslAtucPerfLofs.19 = Counter32: 453
ADSL-LINE-MIB::adslAtucPerfLoss.19 = Counter32: 13
ADSL-LINE-MIB::adslAtucPerfLprs.19 = Counter32: 0
ADSL-LINE-MIB::adslAtucPerfESs.19 = Counter32: 1997
ADSL-LINE-MIB::adslAtucPerfInits.19 = Counter32: 18
ADSL-LINE-MIB::adslAtucPerfCurr15MinLofs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAtucPerfCurr15MinLoss.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAtucPerfCurr15MinLprs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAtucPerfCurr15MinESs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAtucPerfCurr15MinInits.19 = Gauge32: 0
ADSL-LINE-MIB::adslAtucPerfCurr1DayLofs.19 = Gauge32: 445 seconds
ADSL-LINE-MIB::adslAtucPerfCurr1DayLoss.19 = Gauge32: 14 seconds
ADSL-LINE-MIB::adslAtucPerfCurr1DayLprs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAtucPerfCurr1DayESs.19 = Gauge32: 1425 seconds
ADSL-LINE-MIB::adslAtucPerfCurr1DayInits.19 = Gauge32: 9
ADSL-LINE-MIB::adslAturPerfLofs.19 = Counter32: 22 seconds
ADSL-LINE-MIB::adslAturPerfLoss.19 = Counter32: 15 seconds
ADSL-LINE-MIB::adslAturPerfLprs.19 = Counter32: 0 seconds
ADSL-LINE-MIB::adslAturPerfESs.19 = Counter32: 95908 seconds
ADSL-LINE-MIB::adslAturPerfCurr15MinLofs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAturPerfCurr15MinLoss.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAturPerfCurr15MinLprs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAturPerfCurr15MinESs.19 = Gauge32: 345 seconds
ADSL-LINE-MIB::adslAturPerfCurr1DayLofs.19 = Gauge32: 22 seconds
ADSL-LINE-MIB::adslAturPerfCurr1DayLoss.19 = Gauge32: 15 seconds
ADSL-LINE-MIB::adslAturPerfCurr1DayLprs.19 = Gauge32: 0 seconds
ADSL-LINE-MIB::adslAturPerfCurr1DayESs.19 = Gauge32: 21932 seconds
ADSL-LINE-MIB::adslAtucChanCorrectedBlks.19 = Counter32: 4045523380
ADSL-LINE-MIB::adslAtucChanUncorrectBlks.19 = Counter32: 23129426
ADSL-LINE-MIB::adslAtucChanPerfCurr15MinCorrectedBlks.19 = Gauge32: 6278668
ADSL-LINE-MIB::adslAtucChanPerfCurr15MinUncorrectBlks.19 = Gauge32: 17489
ADSL-LINE-MIB::adslAtucChanPerfCurr1DayCorrectedBlks.19 = Gauge32: 1783723602
ADSL-LINE-MIB::adslAtucChanPerfCurr1DayUncorrectBlks.19 = Gauge32: 10061120
ADSL-LINE-MIB::adslAturChanCorrectedBlks.19 = Counter32: 4235
ADSL-LINE-MIB::adslAturChanUncorrectBlks.19 = Counter32: 9371
ADSL-LINE-MIB::adslAturChanPerfCurr15MinCorrectedBlks.19 = Gauge32: 0
ADSL-LINE-MIB::adslAturChanPerfCurr15MinUncorrectBlks.19 = Gauge32: 0
ADSL-LINE-MIB::adslAturChanPerfCurr1DayCorrectedBlks.19 = Gauge32: 1538
ADSL-LINE-MIB::adslAturChanPerfCurr1DayUncorrectBlks.19 = Gauge32: 5374

Many of the figure in the output of 'show dsl interface atm0' are available via SNMP. For example 'Down Stream Noise Margin' is 'adslAturCurrSnrMgn' quoted in tenths of a decibel. So a reading of 15 means 1.5dB. To get an idea of how these values were changing with time, I used Cacti and RRD to poll the SNMP and produce some nice graphs. I Googles and found a Cacti 'ADSL Line Data' templates that allows you to graph Attenuation, Noise margin, and Output Power. Of those, only Noise Margin is interesting, because I found the Attenuation and Output Power to remain constant. But the thing the template did not graph was the number of uncorrectable errors, labelled 'adslAtucChanUncorrectBlks' in the MIB, and this is what I was really interested in, as when we were having problems with the ADSL line, the count of uncorrectable errors was high.

So I modified '/var/www/cacti/resource/snmp_queries/adslLine.xml' to include another OID. I chose 'ADSL-LINE-MIB::adslAtucChanPerfCurr15MinUncorrectBlks' which the MIB decribes as follows:

# snmptranslate -Td ADSL-LINE-MIB::adslAtucChanPerfCurr15MinUncorrectBlks
ADSL-LINE-MIB::adslAtucChanPerfCurr15MinUncorrectBlks
adslAtucChanPerfCurr15MinUncorrectBlks OBJECT-TYPE
-- FROM ADSL-LINE-MIB
MAX-ACCESS read-only
STATUS current
DESCRIPTION "Count of all blocks received with uncorrectable
errors on this channel within the current 15 minute
interval."

I modified the XML for the data query as follows:

# snmptranslate -On ADSL-LINE-MIB::adslAtucChanPerfCurr15MinUncorrectBlks
.1.3.6.1.2.1.10.94.1.1.10.1.11

# diff -u adslLine.orig.xml adslLine.xml
--- adslLine.orig.xml 2009-05-01 16:57:18.000000000 +0100
+++ adslLine.xml 2009-06-15 18:53:58.000000000 +0100
@@ -27,6 +27,13 @@
<direction>input</direction>
<oid>.1.3.6.1.2.1.10.94.1.1.3.1.2</oid>
</adslAtucInvVendorID>
+ <adslAtuc15mUncrBlks>
+ <name>Local Curr 15min Uncorrectable Blocks</name>
+ <method>walk</method>
+ <source>value</source>
+ <direction>output</direction>
+ <oid>.1.3.6.1.2.1.10.94.1.1.10.1.11</oid>
+ </adslAtuc15mUncrBlks>
<adslAtucCurrSnrMgn>
<name>Local Noise Margin (tenth dB)</name>
<method>walk</method>

When I started to graph this value, I was surprised by the range of values I saw. Often it was low, a few hundred, but some time it would shoot up to 60,000! I realised that to graph this range of values it would be better to use a logarithmic scale.

After graphing uncorrectable errors for a while I began to question what does 'within the current 15 minute interval.' actually mean. So let's look in detail at the OID's related to uncorrectable errors. Here are some (high) starting figures:

# snmpwalk -v1 -c public 192.168.6.1 1.3.6.1.2.1.10.94.1.1.10.1
SNMPv2-SMI::transmission.94.1.1.10.1.3.19 = Counter32: 293561906
SNMPv2-SMI::transmission.94.1.1.10.1.4.19 = Counter32: 1493720
SNMPv2-SMI::transmission.94.1.1.10.1.10.19 = Gauge32: 6671894
SNMPv2-SMI::transmission.94.1.1.10.1.11.19 = Gauge32: 80028
SNMPv2-SMI::transmission.94.1.1.10.1.15.19 = Gauge32: 293561906
SNMPv2-SMI::transmission.94.1.1.10.1.16.19 = Gauge32: 1493720

You can see that "adslAtucChanCorrectedBlks" and "adslAtucChanPerfCurr1DayUncorrectBlks" are identical values, and they were increasing exactly in step, so they are both SNMP 'Counters'. A few minutes latter, we get these reading:

# snmpwalk -v1 -c public 192.168.6.1 1.3.6.1.2.1.10.94.1.1.10.1
SNMPv2-SMI::transmission.94.1.1.10.1.3.19 = Counter32: 293628501
SNMPv2-SMI::transmission.94.1.1.10.1.4.19 = Counter32: 1495372
SNMPv2-SMI::transmission.94.1.1.10.1.10.19 = Gauge32: 27251
SNMPv2-SMI::transmission.94.1.1.10.1.11.19 = Gauge32: 1383
SNMPv2-SMI::transmission.94.1.1.10.1.15.19 = Gauge32: 293628501
SNMPv2-SMI::transmission.94.1.1.10.1.16.19 = Gauge32: 1495372

In the period between taking these readings, the '15Min' figures, have reset to zero, and are counting up again. If you keep a constant watch on these figures, you can see that every 15 minutes they reset back to zero. I found the following command to be useful to monitor errors occuring in real time.

# watch -n 5 'snmpwalk -v1 -c public 192.168.6.1 1.3.6.1.2.1.10.94.1.1.10.1'

So with the benefit of hindsight, I may have been better using Cacti to graph 'adslAtucChanUncorrectBlks' as a 'COUNTER', rather than 'adslAtucChanPerfCurr15MinUncorrectBlks' as a 'GAUGE'.

Eventually we got BT to come on site to check the ADSL line, and they located
a bad connection. Since they fixed that, the noise margin has remained consistently above 5dB and the uncorrected blocks has been very low.

Here are some graph, where you can that the ADSL line has been fixed:

Tuesday, October 07, 2008

A new way to search & browse the OpenSolaris Mail Lists

You can now search and browse through over a quarter million messages posted to the OpenSolaris Mail-lists using a free service called 'MarkMail'.

Head over to http://opensolaris.markmail.org/ and take a look.

I stumbled across the MarkMail web site last week,via Google. It seems to have been running for about 12 months, and in that time they have loaded the mail archives of thousands of open-source projects into their system.

But until last week, they did not include anything on OpenSolaris. A shocking omission! So I left feedback, and I was quickly contacted by MarkMail's Jason Hunter. Jason wasted no time in subscribing their system to most of the OpenSolaris lists, and then within a few days they had loaded the 'MailMan' archive files. (If you would like to see further lists added then leave feedback on the MarkMail site via their form.)

By going to this link http://opensolaris.markmail.org/search/?q= you can see the latest emails that have been added into the system. I find this a useful way of just getting some idea of the vast amount of activity being generated around the OpenSolaris communities and projects. If your browser has 'flash' enabled, you will also see a graph of email activity per month, which shows the story of tremendous growth in interest around OpenSolaris. You will also see some fascinating statistics which tell you which are the most popular mail lists and which individuals have made the most posts.

Tuesday, September 16, 2008

Progress on FISHworks?

Very little has been publicly announced by Sun about the FISHworks project.
After some initial news in early 2007, it seemed to go quiet.

Based on those reports, I would speculate that FISHworks appears to be a project to create a NAS/SAN appliance, based on existing OpenSolaris source code for ZFS, CIFS server, NFS server and the iScsi target, using DTrace observability to instrument under the hood.

At the start of September 2008, some more details about FISHworks emerged
when one of Sun's top engineers Mike Shapiro gave a video interview, bizarrely, on the noisy streets of San Francisco! It's a pity that this interview was conducted outside of a bar, as the background noise is somewhat distracting!

So when can we expect a proper announcement from Sun on FISHworks. Well according to the New York Time, that will be in November.

So do you want to know what progress is being made?

Then I suggest you check out the OpenSolaris bugs database.
This link shows bug activity in the iScsi target code for the last 30 days.

In the 'Reported Against' field, you will often see things like fw_36. This I believe is referring to FISHworks build 36.

Looking at the bugs for the last 30 days, there seems to be extensive & intensive testing happening of the OpenSolaris iScsi target, using various iScsi initiators and various operating systems, like Vista & Linux.

Some of the bugs looks quite obscure, occurring in relatively complex scenarios. Many of the bug comments refer to 'GRITS' and 'Diskomizer'. These seem to be Sun's internals tools for stress testing storage systems. Again, there does not seem to be much public information on these tools, but I googled and found these snippets on GRITS and Diskomizer.

It's good to see that in GRITS and Diskomizer, Sun seems to have the tools that
can expose these bugs. It's also good that these bugs are being fixed quickly.
Sun's Tim Szeto looks to have been particularly busy, and making good progress.

Some of the bug descriptions mention the 'Iwashi NAS appliance'
-So what it that? I would speculate maybe the code name for a Sun appliance?

Wednesday, August 20, 2008

'Java.exe' and Oracle SQL Developer

I've just upgraded to v1.5.1 of 'Oracle SQL Developer'. When you first start the program, it will ask for the path to 'java.exe'. If you need to change or reset that path to a different value, you need to locate text file 'sqldeveloper.conf' and edit the value of setting 'SetJavaHome'. You may want to change this setting if you try to use an 'Unsupported Java Version'.

Friday, August 15, 2008

OpenSolaris and the HP ML-115

The HP ML-115 is a very low cost server, that is ideal for trying out VMware ESX:
"Using the HP ML115 as a Test Server"
"Building a Low Cost (Cheap) VMware ESX Test Server"

So maybe the ML-115 would also be suitable for OpenSolaris.
So I tried it out, but there are some driver issues.
The good news is that these should be fixed very soon.

Problem #1 -If you try to boot the Open Solaris 2008.05 live CD, it fails, with error "Console login service(s) cannot run". See this thread.

The problem is that the ML-115's DVD drive is connected to a SATA port on the nVidia MCP55 chipset, and the nv_sata driver has only support for disk drives and not CD/DVD drives. But now Bug 6595488 is fixed, if you use build 95.

Ok, so with build 95, you should be able to boot from the DVD drive and install OpenSolaris. I tried this with "Solaris Express Community Edition" and was successful. You should then see something like this:

# cfgadm -a | grep sata | grep configured
sata0/0::dsk/c0t0d0 disk connected configured ok
sata1/0::dsk/c4t0d0 disk connected configured ok
sata2/0::dsk/c1t0d0 cd/dvd connected configured ok
# rmformat -l
Looking for devices...
1. Logical Node: /dev/rdsk/c1t0d0p0
Physical Node: /pci@0,0/pci103c,1714@5,2/cdrom@0,0
Connected Device: HL-DT-ST DVD-RAM GH15L FA01
Device Type: DVD Reader/Writer
Size: 2.9 GB
# cdrw -l
Looking for CD devices...
Node Connected Device Device type
----------------------+--------------------------------+-----------------
cdrom0 | HL-DT-ST DVD-RAM GH15L FA01 | CD Reader/Writer
# modinfo | grep nv_sata
45 fffffffff7883000 6458 207 1 nv_sata (Nvidia ck804/mcp55 HBA v1.7)

Problem #2 - X will not run to give a graphical interface. If you check the Xerrors file, you will see:

(EE) MGA(0): Given color and fb depth combination not supported by this driver
(EE) Screen(s) found, but none have a usable configuration.

The graphics chipset on ML-115 is identified as "MGA G200e [Pilot] ServerEngines"
aka "mgag200 SE A PCI" aka "vendor 0x102b device 0x0522" and uses the MGA driver, in file "mga_drv.so". The version of the MGA driver supplier with OpenSolaris is version 1.4.6

I tried booting a Linux LiveCD, GRML v1.1, which uses
version 1.4.7 of the MGA driver and this worked without problem on the ML-115.

The OpenSolaris MGA driver has recently been updated. See Bug 6735095 and that fix should be available in build 97. Let's hope that fixes the X problem.

Problem #3 - No network card detected. The ML-115 network card is identified as
"Broadcom Corporation NetXtreme BCM5722 Gigabit Ethernet PCI Express" aka "vendor 0x14e4 device 0x165a". The fix for this should be available in build 96 - see Bug 6726056.

Updated ISO for OpenSolaris to build 95 are available now. For an update to build 97, I guess that means a wait of a further 4 weeks.