2010
02.10

Failed to shutdown DBConsole Gracefully

1 person likes this post.
Share

Pretty average Oracle RAC Cluster (10g, two nodes). All of a sudden, trying to stop dbconsole results in an error:

[oracle@racnode1 log]$ emctl stop dbconsole
TZ set to Europe/Vatican
Oracle Enterprise Manager 10g Database Control Release 10.2.0.4.0
Copyright (c) 1996, 2007 Oracle Corporation.  All rights reserved.
https://racnode1:1158/em/console/aboutApplication
Stopping Oracle Enterprise Manager 10g Database Control ...
--- Failed to shutdown DBConsole Gracefully ---
 failed.

Similar behaviour when attempting to start it:

[oracle@racnode1 log]$ emctl start dbconsole
TZ set to Europe/Vatican
Oracle Enterprise Manager 10g Database Control Release 10.2.0.4.0
Copyright (c) 1996, 2007 Oracle Corporation.  All rights reserved.
https://racnode1:1158/em/console/aboutApplication
Agent Version     : 10.1.0.6.0
OMS Version       : 10.1.0.6.0
Protocol Version  : 10.1.0.2.0
Agent Home        : /opt/oracle/product/10.2.0/db_1/racnode1_DBSID1
Agent binaries    : /opt/oracle/product/10.2.0/db_1
Agent Process ID  : 24756
Parent Process ID : 24753
Agent URL         : https://racnode1:3938/emd/main
Started at        : 2010-02-09 13:48:34
Started by user   : oracle
Last Reload       : 2010-02-09 13:48:34
Last successful upload                       : (none)
Last attempted upload                        : (none)
Total Megabytes of XML files uploaded so far :     0.00
Number of XML files pending upload           :     3971
Size of XML files pending upload(MB)         :    50.11
Available disk space on upload filesystem    :    59.30%
Agent is already started. Will restart the agent
Stopping agent ... stopped.
Starting Oracle Enterprise Manager 10g Database Control ............................................................................................. failed.
------------------------------------------------------------------
Logs are generated in directory /opt/oracle/product/10.2.0/db_1/racnode1_DBSID1/sysman/log

Peeking into the emdctl.trc logfile, I found something that definitely smelled of expired certificates:

2010-02-09 13:54:14 Thread-4134193952 ERROR http: 6: Unable to initialize ssl connection with server, aborting connection attempt
2010-02-09 13:54:16 Thread-4133477152 WARN  http: snmehl_connect: connect failed to (racnode1:3938): Connection refused (error = 111)
2010-02-09 13:54:41 Thread-4134140704 ERROR ssl: nzos_Handshake failed, ret=29024

This thread on Oracle forums seems to confirm my suspicion.

And here’s what you do to fix the issue:

Ready your environment ($ORACLE_SID, $ORACLE_HOME, …). I “source” a script for that, each instance has its own.

[oracle@racnode1 ~]$ cat envDBSID.sh
export ORACLE_SID=DBSID1
export ORACLE_HOME=/opt/oracle/product/10.2.0/db_1
export PATH=/usr/local/bin:/bin:/usr/bin:/home/oracle/bin:"$ORACLE_HOME"/bin

[oracle@racnode1 ~]$ . envDBSID.sh

As I said, if the certificate is expired, dbconsole won’t shutdown cleanly. Fetch its pid and kill it manually.

[oracle@racnode1 ~]$ cat /opt/oracle/product/10.2.0/db_1/racnode1_DBSID1/emctl.pid  25608

[oracle@racnode1 ~]$ ps axo pid,command | grep 25608
25608 /opt/oracle/product/10.2.0/db_1/jdk/bin/java -server -Xmx256M -XX [..]

[oracle@racnode1 ~]$ kill 25608

[oracle@racnode1 ~]$ ps axo pid,command | grep 25608
[oracle@racnode1 ~]$

Run emctl secure dbconsole, it will generate the new certificates. Provide Oracle SYSMAN’s password and the hostname you’ll use (without domain name, in my case).
The URL displayed (the port number) will also tell you if you’re on the right instance and got the intended environment.

[oracle@racnode1 ~]$ emctl secure dbconsole
TZ set to Europe/Vatican
Oracle Enterprise Manager 10g Database Control Release 10.2.0.4.0
Copyright (c) 1996, 2007 Oracle Corporation.  All rights reserved.
https://racnode1:5500/em/console/aboutApplication
Enter Enterprise Manager Root password :
Enter a Hostname for this OMS : racnode1

DBCONSOLE already stopped...   Done.
Agent is already stopped...   Done.
Securing dbconsole...   Started.
Checking Repository...   Done.
Checking Em Key...   Done.
Checking Repository for an existing Enterprise Manager Root Key...   Done.
Fetching Root Certificate from the Repository...   Done.
Updating HTTPS port in emoms.properties file...   Done.
Generating Java Keystore...   Done.
Securing OMS ...   Done.
Generating Oracle Wallet Password for Agent....   Done.
Generating wallet for Agent ...    Done.
Copying the wallet for agent use...    Done.
Storing agent key in repository...   Done.
Storing agent key for agent ...   Done.
Configuring Agent...
Configuring Agent for HTTPS in DBCONSOLE mode...   Done.
EMD_URL set in /opt/oracle/product/10.2.0/db_1/racnode1_RDS1/sysman/config/emd.properties
   Done.
Configuring Key store..   Done.
Securing dbconsole...   Sucessful.

Try and start dbconsole.

[oracle@racnode1 ~]$ emctl start dbconsole
TZ set to Europe/Vatican
Oracle Enterprise Manager 10g Database Control Release 10.2.0.4.0
Copyright (c) 1996, 2007 Oracle Corporation.  All rights reserved.
https://racnode1:5500/em/console/aboutApplication
Starting Oracle Enterprise Manager 10g Database Control ............................. started.
------------------------------------------------------------------
Logs are generated in directory /opt/oracle/product/10.2.0/db_1/racnode1_RDS1/sysman/log

[oracle@racnode1 ~]$ emctl status dbconsole
TZ set to Europe/Vatican
Oracle Enterprise Manager 10g Database Control Release 10.2.0.4.0
Copyright (c) 1996, 2007 Oracle Corporation.  All rights reserved.
https://racnode1:5500/em/console/aboutApplication
Oracle Enterprise Manager 10g is running.
------------------------------------------------------------------
Logs are generated in directory /opt/oracle/product/10.2.0/db_1/racnode1_RDS1/sysman/log

Be prepared to handle the same situation in a few months. To see the new certificate expiry date, open any HTTPS URL served by dbconsole (i.e.: https://racnode1:5500/em/console/aboutApplication) and click on the lock icon your web browser should show somewhere.

2010
02.02

Remote operations on vmdk disks

Be the first to like.
Share

One of the local PCs is using VMware Player to run an (old and small) Windows 2000 Virtual Machine.
We need to expand its virtual disk but VMware Player lacks the tools to do that… I decided to see if it’s possibile to perform the expansion without moving the VM to a Server/ESXi/vSphere VMware host, using the commands (vmrun, vmware-vdiskmanager) that come along the VMware Workstation installation I’ve got on my Linux laptop. The VM is hosted on a Windows XP PC.
Here’s what I did:

  • Mount the host C$ share on the Linux laptop, using CIFS.
  • Unexpectedly, the VM has got some active snapshots. Since the whole snapshot machinery is not available in VMware Player, the VM should come out of a VMware Server/ESXi/vSphere host.
    # vmrun -T ws listSnapshots "Windows 2000 Professional.vmx"
    Total snapshots: 2
    Snapshot 1
    Snapshot 2
  • You can’t expand a disk that underwent some snapshots, they’ve got to be removed first:
    # vmrun -T ws deleteSnapshot "Windows 2000 Professional.vmx" "Snapshot 1"

    First caveat: the snapshot removal process causes vmrun to read/write large amount of data, depending on the snapshot’s size itself, I guess. You’re running it across the network: you’ll have to wait a while until it’s done. I’ve been monitoring the process with iftop -B -i eth0 -f ‘host hostname to have an estimate of how long it would take to complete. Each snapshot needs to be removed by means of vmrun.

  • Expansion time:
    # vmware-vdiskmanager -x 25GB "Windows 2000 Professional.vmdk"
      Grow: 100% done.
    Disk expansion completed successfully.

    WARNING: If the virtual disk is partitioned, you must use a third-party
             utility in the virtual machine to expand the size of the
             partitions. For more information, see:
             http://www.vmware.com/support/kb/enduser/std_adp.php?p_faqid=1647

    vmware-vdiskmanager appends X bytes of space (where X = new_capacity – previous_capacity) to the chosen vmdk. Again, those bytes travel through the network, be patient.

  • As vmware-vdiskmanager was so kind to remind us, we just expanded the disk, not the logical partitions contained therein. Some kind of partition manager software is needed to complete the job, my fav being Acronis Disk Director.

Summing things up, it’s indeed possible to operate remotely on vmdk disks. Did we save any time by doing things this way, that is without copying the VM “out” of its host then moving it back onto it? If the snaphots are more than one, no doubt that “local beats network”. And what about the expansion? If we add some (empty) data to a vmdk, then need to move it back to the host, then local shouldn’t be any faster than network. Better would be to use an external drive of sorts. Better yet is to temporarily copy/install VMware workstation on the host, or just vmrun/vmware-vdiskmanager executable files and their dependencies. That’s what I’ll try the next time.

2010
01.26

VMware vSphere Client/vCenter Server Version 4.0.0 Build 208111.

We created a bunch of new LUNs, planning to increase an existing Datastore’s capacity.
VMware side, the operation should be a matter of simply firing up vSphere Client, choosing a host, Configuration tab, viewing the Datastore’s properties, then clicking on the Increase button. Except that no Extent device seems to be found. That’s weird because we already did (multiple times) a rescan of each Storage Adapter/HBA. Moreover, selecting “Add Storage” as if we were to create a new Datastore, indeed shows the expected volumes.
The solution turned out to be this one:

  • Connect vSphere Client directly to the host (thus logging in as root), and not to the vCenter Server.
2010
01.20

From SQL to Excel, with Perl

Be the first to like.
Share

Quite often I’m asked to pull out some information from a database, process it and produce an Excel report.
Here is a minimal Perl script that carries out the task.

  • Define the column headings and their widths. @columns array.
  • Handle the command line parameters. There are 5 in the example, assigned to the $p_* variables.
  • Prepare the Excel worksheet, defining cell formatting, …
  • Connect to the database.
  • Prepare the query, substituting the command line parameters.
  • Fetch rows, populate the sheet.
#!/usr/bin/perl
# Giuliano - http://www.108.bz
use strict;
use DBI;
use Spreadsheet::WriteExcel;

use constant C_HEADING => 0;
use constant C_WIDTH   => 1;
my @columns = (
    ['Date',      22 ],
    ['Caller',    20 ],
    ['Called',    20 ],
    ['Connected', 11 ],
    ['Duration',  11 ],
    ['Reason',    24 ],
    ['XferExt',   11 ],
    ['XferName',  22 ]
);

die <<EOM unless @ARGV == 5;
usage:
$0 year month day phonenumber file.xls
EOM
my ($p_year, $p_month, $p_day, $p_phnumber, $p_filename) = @ARGV;

my $workbook        = Spreadsheet::WriteExcel->new($p_filename);
my $sheet           = $workbook->add_worksheet("Data");
my $default_format  = $workbook->add_format(num_format => '@'); $default_format->set_font('Verdana'); $default_format->set_border(1);
my $bold_format     = $workbook->add_format(); $bold_format->set_font('Verdana'); $bold_format->set_bold(); $bold_format->set_border(1);

$sheet->write(0,$_,$columns[$_]->[C_HEADING], $bold_format) for (0..$#columns);
$sheet->set_column($_, $_, $columns[$_]->[C_WIDTH]) for (0..$#columns);

my $dbh = DBI->connect('dbi:Sybase:server=dsnname;database=dnbame','username','password') or die;

my $sth = $dbh->prepare(<<EOQ
SELECT IpPbxCDR.StartTime, IpPbxCDR.OriginationNumber, IpPbxCDR.CalledNumber, IpPbxCDR.DestinationNumber, DATEDIFF(ss, IpPbxCDR.StartTime, IpPbxCDR.EndTime) AS Duration, IpPbxCDR.DisconnectReason, IpPbxCDR_1.CalledNumber AS XferExt,
IpPbxCDR_1.CalledName AS XferName
FROM IpPbxCDR LEFT OUTER JOIN
IpPbxCDR AS IpPbxCDR_1 ON IpPbxCDR.TransferredToCallId = IpPbxCDR_1.CallId
WHERE (IpPbxCDR.CalledNumber LIKE '$p_phnumber') AND
(MONTH(IpPbxCDR.StartTime) = $p_month) AND
(YEAR(IpPbxCDR.StartTime) = $p_year) AND
(DAY(IpPbxCDR.StartTime) = $p_day)
ORDER BY IpPbxCDR.StartTime
EOQ
);

$sth->execute();

my $i = 1;
my $row;
while ( $row = $sth->fetchrow_arrayref ) {
    $sheet->write_string($i,$_,$row->[$_], $default_format) for (0..$#$row);
    $i++;
}

$sheet->activate();

exit;

Actually, the example does something useful. It connects to a Swyx Call Detail Record database, selecting phone calls placed to a given number on a given day. The generated report also contains call duration and transfer status/destination, if any. Here’s what it looks like (some data has been obfuscated, to protect the innocent – click to see all the columns):

And here’s the command that produces it:

./callreport.pl 2010 1 19 '+39%10123123' x.xls
2010
01.19

Scenario:

  • Headquarter (HQ) connected (MPLS VPN) to some branch sites.
  • In some of the branches, a Check Point UTM-1 Edge X (SofaWare) sits between the wireless and wired networks, enforcing security policies between them.
  • The two networks are bridged together (Layer 2) by the firewall.
  • The wireless LAN is used by some kind of next gen Barcode Scanner: an embedded device with Windows CE .NET 4.2, also able to act as a Terminal Services client.

Customer wants to install some software on the scanners, downloading it from a shared folder residing on one of HQ servers. I add the necessary (and temporary) rules on the firewalls, but the folder still cannot be reached. Windows CE complains that “The network path was not found” but the rules look good.

Luckily, the Edge firewalls provide a packet sniffer, allowing us to further investigate the issue. Just connect to the web based interface of UTM-1/SofaWare, go to SetupToolsSniffer, choose a filter string (using the familiar libpcap/tcpdump syntax), select the interface (“bridge”, in my case), and you’re set. Captured packets can then be downloaded to your PC and opened up in Wireshark.

We came up with a bunch of peculiar NetBIOS Name query requests/answers:

$ tshark -r sniffer4.cap
  1   0.000000    192.168.2.3  -> 192.168.1.10  NBNS Name query NB HQSERVER<20>
  2   0.028436    192.168.1.10 -> 192.168.2.3   NBNS Name query response
  3   1.001397    192.168.2.3  -> 192.168.2.255 NBNS Name query NB HQSERVER<20>
  4   1.251460    192.168.2.3  -> 192.168.2.255 NBNS Name query NB HQSERVER<20>
  5   1.502820    192.168.2.3  -> 192.168.2.255 NBNS Name query NB HQSERVER<20>

Some hostnames, for clarity:

  192.168.2.3  : BOSCANNER
  192.168.1.10 : HQDC1
  192.168.1.20 : HQSERVER

The Barcode Scanner (Client) asks one of the DNS/Domain Controllers in HQ if it is called HQSERVER. But HQSERVER is the server we’re trying to connect to from the Scanner (by means of \\HQSERVER\sharename)! Why in the world the device should directly ask HQDC1 if it is called HQSERVER? Using an unicast NetBIOS query, too? Obviously HQDC1 answers “no, it’s not me” (Requested name does not exist)… The Scanner then broadcasts the same query to its local network segment, but since HQSERVER sits in Headquarter, it gets no answer and generates the error “The network path was not found”.
Turns out that \\192.168.1.20\sharename causes the same dialoque, with a NetBIOS name query that seemingly asks for a server named “192.168.1.20”. It’s as if in Windows CE, UNC paths could only use names, not IP addresses.

Well, the Customer didn’t have enough time for me to properly solve/understand the issue but we worked around it by:

  • Assigning a static IP to the Windows CE device.
  • In the TCP/IP settings of Windows CE, use 192.168.1.20 (HQSERVER – where the shared folder is hosted) as DNS and WINS server.
  • Copy the needed files from the network share and revert back to DHCP.

Step two makes the Client send NetBIOS name queries to HQSERVER instead of HQDC1. This allows shared folder access to work.

2010
01.19

Unknown devices on IBM servers

4 people like this post.
Share

When installing Windows on IBM, without using IBM ServerGuide, you’ll sometimes end up having two unknown devices in Device Manager:

ASF Table
ACPI\ASF0001\2&DABA3FF&0

and:

IBM Active PCI Device
ACPI\IBM37D4\2&DABA3FF&0

To deal with the first one, see document MIGR-43764 and download the driver mentioned there (it’s called “25k9219.zip”).
The latter can be fixed by installing the “IBM Active PCI Software”, you can find it on your server’s support page, e.g. here (“90p4169.exe”).

Also, document MIGR-51940, Installing Microsoft Windows Server 2003 version 1.0 – Servers, proves useful.

And a last bit: if you’re in a hurry or haven’t got the CD handy, ServeRAID Manager Software can be installed by simply copy/pasting its folder from another server. It usually works. 🙂

2010
01.08

On MS Windows operating systems, many processes run under the NT AUTHORITY\SYSTEM account, be them scheduled tasks or services.
Sometimes it’s useful to run cmd.exe as the SYSTEM user and see what’s going on. Here’s a nifty trick to do it.

C:\Documents and Settings\giuliano>time /t
17:10

C:\Documents and Settings\giuliano>at 17:11 /interactive cmd.exe
Added a new job with job ID = 1

C:\Documents and Settings\giuliano>

Basically you check what time it is and schedule cmd.exe to run on the next minute. You do that by means of the at.exe OS command.

When the time comes, a Command Prompt window should pop-up. It runs under the SYSTEM account:

Microsoft Windows [Version 5.2.3790]
(C) Copyright 1985-2003 Microsoft Corp.

C:\WINDOWS\system32>whoami
nt authority\system

C:\WINDOWS\system32>

Each process you run from there, also runs as SYSTEM. If you run regedit.exe, for instance, you can import registry data into the SYSTEM user’s hive. Today I used this tecnique to export/import Putty’s settings (they are stored in the registry) in order to make plink.exe, as run from a UPS monitoring Agent, see a pre-configured SSH “Session” (hostname, login username, private key, …). I needed the Agent to shut down a bunch of Linux servers when the battery charge was running low: plink.exe on Windows side and sudo on the Linux one, did the job.

For completness sake, here‘s a post on the same subject. It also deals about Vista/Windows Server 2008 and how to achieve our goal using PsExec.

2010
01.07

When setting up High Availability on FortiGate, one thing struck me as a bit unusual. Differently from other firewall clustering solutions (correct me if I’m wrong), FortiGate devices don’t force you to assign both physical and logical IP addresses on interfaces. You are supposed to configure logical IP addresses only. This implies that you can’t directly access a specific node/firewall in your cluster. You have to SSH into the Master unit and, from there, log into the Subordinate one(s). Here are the relevant CLI commans:

FW-NODE-A # get system ha status
Model: 100
Mode: a-p
Group: 0
Debug: 0
ses_pickup: disable
Master:129 FW-NODE-A      FG100C3000000000 0
Slave :128 FW-NODE-B      FG100C3000000001 1
number of vcluster: 1
vcluster 1: work 169.254.0.1
Master:0 FG100C3000000000
Slave :1 FG100C3000000001

FW-NODE-A # execute ha manage ?
please input peer box index.
<1>     Subsidary unit FG100C3000000001

FW-NODE-A # execute ha manage 1

FW-NODE-B $

I wonder what would happen if the Master unit were to hang. I mean: stuck itself in a state where the failover mechanism doesn’t work and neither does SSH/HTTPS access. How could you remotely force a failover to another node? In such a scenario, is a physical power cycle of the master unit the only option?

2009
12.29

By connecting (SSH, “admin” user) to a Symantec Brightmail Gateway appliance 1, you are left in a restricted shell where only a limited set of commands is available. The undocumented “set-support” command may come in handy: it assigns a temporary password to the “support” user, a normal unix account with a standard shell.

giuliano@balrog ~ $ ssh admin@192.1.2.3
admin@192.1.2.3's password:

bmail> set-support
Warning: Do NOT execute this script without explicit direction from a Symantec
Customer Support person.
Changing password for user support.
New password:
BAD PASSWORD: it is based on a dictionary word
Retype new password:
passwd: all authentication tokens updated successfully.
User support enabled until 01/04/2010.
bmail> logout
Connection to 192.1.2.3 closed.

giuliano@balrog ~ $ ssh support@192.1.2.3
support@192.1.2.3's password:

[support@bmail support]$ echo $SHELL
/bin/bash

What’s nice about the “support” user is that he can run tcpdump and access useful logfiles, e.g.:

support@bmail support]$ tail -f /data/logs/stats.maillog
2007 Mar 30 11:36:17 (info) delivery-mta/smtp[2008]: 45A689A9: to=, relay=192.1.2.3[192.1.2.3], delay=0, status=sent (250 OK)

A note about the restricted shell command “watch maillog” and the “/data/logs/stats.maillog” file.
The latter is the truly useful MTA log file (holding a realtime record of which messages are relayed through the appliance), while the “watch maillog” command shows entirely different stuff. There used to be a proper “watch stats.maillog” command, but at some point Symantec decided to remove it, can’t really tell why.

I originally learnt about the “set-support” command here (Symantec Support Forums).

If you need full root access, you can restart the appliance, break into GRUB’s command line interface, append a “1” to the kernel parameters in order to boot to runlevel 1 (single user mode). There you can change the root password to whatever you like and make Symantec’s Tech Support upset. 🙂 I had to do it a couple of times to replace a failed disk, though…

  1. Or Symantec Mail Security, like it was previously called. It’s an antispam device, coming in either hardware or virtual (VMware) appliance versions. Models I’ve seen: 8240, 8260. Almost “install and forget”, if you ask me. That means it works quite well! 😉
2009
12.27

FortiGate/Cisco Layer 2 woes

Be the first to like.
Share

The other day I swapped a firewall with a different one, a FortiGate 60B. After having re-created the config, everything seemed to be functional but: Internet browsing “felt” a bit sluggish (I was on a 20Mbps uplink) and, here comes the weirdness, when I did “something” the whole WAN connectivity would just hang for a couple of minutes. The issue was reproducible by trying to connect via Remote Desktop to one of the published servers (by tunneling through my Employer’s Office, and bouncing back on the Customer’s firewall) or even by opening my Flickr page (but then the cause could’ve been the poor quality of the pictures therein 😉 ).
At first, I thought about a dreadful MTU issue, maybe the firewall/router or something along the road was choking when fragmenting or reassembling packets. But, a “ping outside_host -s 1472 -M do” (or “ping -f -l 1472 outside_host“, on Windows) proved that ICMP packets 1500 bytes big (1472 bytes of payload, plus 28 bytes of ICMP header) could indeed flow out and back without being fragmented: the issue was totally random.
Besides that, even lowering the MTU on my PC wouldn’t change anything.
After much cursing, I tried to see if anything was going on at L2 level. Firewall and router (Cisco, owned by the ISP, not accessible to me) were connected together by a crossover cable.
The relevant FortiOS CLI command is the following:

FIREWALLNAME # diagnose hardware deviceinfo nic wan1
System_Device_Name              wan1
Link                            up
Speed                           100 Mbps full duplex
FlowControl                     Tx off, Rxoff
MTU_Size                        1500

My firewall (the above example comes out from another one) was negotiating 100Mbps speed, Half Duplex. Nothing wrong with that, I tried to fix these parameters on the FortiGate but the Ethernet link would not come up. So, auto-negotiation was mandatory and I had no way to change that on the router.
At some point, when Internet connectivity was stuck, it seemed to me that unplugging and plugging back in the cable between firewall/router, would allow for a faster recovery. Definitely, something was wrong at L2.
The solution was to insert a 15€ DLink switch between firewall and router. No problems since then, it really looks like FortiGate and Cisco NICs don’t play well together, at least in that conditions. The Customer will call the ISP in order to tweak the settings Cisco side and see if they can get rid of the switch.
The proper way to diagnose the problem would’ve been to ping the router from the outside during a connectivity stop. Since the issue was “local”, the router should answer while no traffic should pass from the firewall to the router.