最近配置了一套HB1.0双机,现在写写大概过程,以免遗忘。

实施环境:
[php]
两套Red Flag DC5sp4 64位 系统
浮动ip 192.168.6.3
ha1 ip 192.168.6.1(eth0) 10.10.1.11(eth1)
ha2 ip 192.168.6.2(eth0) 10.10.1.12(eth1)
网关 192.168.6.254
模拟一块 20G 存储
oracle 10.2.0.1
HeartBeat 1.0
[/php]

实施过程:
1、安装系统,此处不做详述;

2、在其他系统上,使用iscsi服务,模拟一块20G的存储给这两台服务器,并测试存储识别正常,关于如何配置iscsi服务,也可以在本站的其他文章内找到,在此不做详述;

3、安装oracle 10.2.0.1 ,安装过程比较简单,有不明白之处,谷哥、度娘都可以解决,所以在此也不做详述,简单声明下几个变量:
$ORACLE_SID orcl
$ORACLE_HOME /opt/app/oracle/product/10.2.0/db_1
oracle system 用户密码 asdasd

4、安装 mon-0.99.2

5、修改hosts文件,配置主机名和ip地址
[php]
[root@dc5ha2 ~]# cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost localhost.localdomain localhost
192.168.6.1 dc5ha1
10.10.1.11 dc5ha1
192.168.6.2 dc5ha2
10.10.1.12 dc5ha2
[/php]

6、进入/etc/ha.d 目录中,配置authkeys和ha.cf文件:
[php]
[root@dc5ha2 ha.d]# cat authkeys
auth 1
1 sha1 rf
[root@dc5ha2 ha.d]# cat ha.cf
bcast eth1 eth0
keepalive 2
deadtime 30
udpport 694
auto_failback off
node dc5ha1
node dc5ha2
logfile /var/log/ha-log
[/php]

7、修改 /etc/ha.d/haresourse 文件:
[php]
[root@dc5ha2 ha.d]# cat haresources
dc5ha1 192.168.6.3 oracle mon
[/php]

8、进入/etc/ha.d/resource.d 目录中,修改oracle文件,该文件就是oracle资源的起停脚本文件,例如:
(点开下面的 show source,展开代码)
[css collapse=”true”]
#!/bin/sh
# $Id: S30oracle,v 1.1 2000/09/27 14:40:55 ar Exp $
#
# Script: $RCSfile: S30oracle,v $
#
# Description: Example Oracle control script
# (Based on dbora example in Oracle docs)
# Requires: dbstart, dbshut from Oracle, oratab file
# Modify as required
#
# Platform: Unix
#
# Author: Ade Rixon – RSi Solutions Ltd
#

#
# Oracle configuration (EDIT)
#

#
# standard variables and functions
#

script="`basename $0`"

#
# args: <start|stop> [no_of_attempts]
#
state=$1 # starting or stopping?

#
ORACLE_OWNER=oracle
export ORACLE_OWNER
ORACLE_HOME=/opt/app/oracle/product/10.2.0/db_1
ORACLE_SID=orcl

#
# decide action based on first argument
#
case "${state}" in

‘start’)
rm -f /tmp/oramon-oracle-orcl.pid

mount /dev/sdb1 /oradata
if [ $? -ne 0 ]
then
sleep 1
mount /dev/sdb1 /oradata

fi

su – "$ORACLE_OWNER" -c "/home/oramon/oramon/bin/Ora_start $ORACLE_HOME $ORACLE_SID"
;;

‘stop’)
su – "$ORACLE_OWNER" -c "/home/oramon/oramon/bin/Ora_stop $ORACLE_HOME $ORACLE_SID"
umount /dev/sdb1

if [ $? -ne 0 ]
then
sleep 1
if df | grep "sdb1"
then
fuser -s -m -k /dev/sdb1 >/dev/null 2>&1
else
exit 0
fi

sleep 3

if df | grep "sdb1"
then
fuser -s -m -k /dev/sdb1 >/dev/null 2>&1
else
exit 0
fi
sleep 1
umount /dev/sdb1

fi
exit 0
;;

*)
echo "Usage: $0 <start|stop>"
exit 1 # warning code
;;

esac

exit 0 # OK code (default)
[/css]

其中,起停均调用的是/home/oramon/oramon/bin 下面的命令,所以需要执行下面操作:
[php]
[root@dc5ha1 /]# cd /home
[root@dc5ha1 home]# chown -R oracle:oinstall oramon/
[/php]
顺便来看看调用的起停脚本文件内容:
(点开下面的 show source,展开代码)
[css collapse=”true”]
[root@dc5ha1 bin]# cat Ora_start
#!/bin/sh
ORACLE_HOME=$1
ORACLE_SID=$2
#ORACLE_HOME=/home/oracle/oracle/product/10.2.0/db_1
export ORACLE_HOME
#ORACLE_SID=oracle
export ORACLE_SID

lsnrctl start
#$ORACLE_HOME/bin/sqlplus <<!
#dbstart
$ORACLE_HOME/bin/sqlplus ‘/ as sysdba’ <<!
startup
exit
!
[root@dc5ha1 bin]# cat Ora_stop
#!/bin/sh
ORACLE_HOME=$1
ORACLE_SID=$2
#ORACLE_HOME=/home/oracle/oracle/product/10.2.0/db_1
export ORACLE_HOME
#ORACLE_SID=oracle
export ORACLE_SID

lsnrctl stop
#$ORACLE_HOME/bin/sqlplus <<!
$ORACLE_HOME/bin/sqlplus ‘/ as sysdba’ <<!
shutdown immediate
exit
!
[/css]

我们可以手动启停下,测试下脚本是否正确:
(点开下面的 show source,展开代码)
[css collapse=”true”]
[root@dc5ha1 /]# cd /etc/ha.d/resource.d/
[root@dc5ha1 resource.d]# ./oracle start

LSNRCTL for Linux: Version 10.2.0.1.0 – Production on 23-6�� -2011 14:28:22

Copyright (c) 1991, 2005, Oracle. All rights reserved.

Starting /opt/app/oracle/product/10.2.0/db_1/bin/tnslsnr: please wait…

TNSLSNR for Linux: Version 10.2.0.1.0 – Production
System parameter file is /opt/app/oracle/product/10.2.0/db_1/network/admin/listener.ora
Log messages written to /opt/app/oracle/product/10.2.0/db_1/network/log/listener.log
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1)))
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=dc5ha1)(PORT=1521)))

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC1)))
STATUS of the LISTENER
————————
Alias LISTENER
Version TNSLSNR for Linux: Version 10.2.0.1.0 – Production
Start Date 23-6�� -2011 14:28:24
Uptime 0 days 0 hr. 0 min. 0 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener Parameter File /opt/app/oracle/product/10.2.0/db_1/network/admin/listener.ora
Listener Log File /opt/app/oracle/product/10.2.0/db_1/network/log/listener.log
Listening Endpoints Summary…
(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=dc5ha1)(PORT=1521)))
Services Summary…
Service "PLSExtProc" has 1 instance(s).
Instance "PLSExtProc", status UNKNOWN, has 1 handler(s) for this service…
The command completed successfully

SQL*Plus: Release 10.2.0.1.0 – Production on ������ 6�� 23 14:28:24 2011

Copyright (c) 1982, 2005, Oracle. All rights reserved.

Connected to an idle instance.

SQL> ORACLE instance started.

Total System Global Area 276824064 bytes
Fixed Size 2020160 bytes
Variable Size 142609600 bytes
Database Buffers 130023424 bytes
Redo Buffers 2170880 bytes
Database mounted.
Database opened.
SQL> Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 – 64bit Production
With the Partitioning, OLAP and Data Mining options
[root@dc5ha1 resource.d]# echo $?
0
[root@dc5ha1 resource.d]#
[root@dc5ha1 resource.d]# ./oracle stop

LSNRCTL for Linux: Version 10.2.0.1.0 – Production on 23-6�� -2011 14:30:19

Copyright (c) 1991, 2005, Oracle. All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC1)))
The command completed successfully

SQL*Plus: Release 10.2.0.1.0 – Production on ������ 6�� 23 14:30:29 2011

Copyright (c) 1982, 2005, Oracle. All rights reserved.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 – 64bit Production
With the Partitioning, OLAP and Data Mining options

SQL> Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 – 64bit Production
With the Partitioning, OLAP and Data Mining options
[root@dc5ha1 resource.d]# echo $?
0
[root@dc5ha1 resource.d]#
[/css]

9、到/usr/lib/mon/mon.d 目录下修改oracle监控文件,例如:
(点开下面的 show source,展开代码)
[css collapse=”true”]
#!/bin/sh
# $Id: oramon,v 1.4 2000/04/05 16:45:05 ar Exp $
#
# Script: oramon.sh
#
# Description: Oracle test script
#
# Platform: Unix
#
# Authors: Vincent Sanders – RSi
#

ORACLE_BASH=/home/oracle
. /home/oracle/.bash_profile
#. /home/oracle/.bash_profile

# defaults
# retry count
RETRYS=3
# location of sql query files
QUERY_HOME=/home/oramon/oramon/lib
# enable or disable the checks (1/0)
# you must enable at least one
LISTENERCHECK=1
SYSTIMECHECK=1
DBFREECHECK=1
WRITEREADCHECK=0

script="`basename $0`"

#if [ $# -ne 3 ]; then
# echo "Usage: ${script} <username> <password>" >&1
# exit
#fi

#ORACLE_SID=$1
#ORACLE_USER=$2
#ORACLE_PASSWD=$3
ORACLE_SID="orcl"
ORACLE_USER="system"
ORACLE_PASSWD="asdasd"
# check for required environment vars
if [ ${ORACLE_HOME:-”} = ” -o ${ORACLE_SID:-”} = ” ]; then
echo "Oracle environment not set, aborting" >&1
exit
fi
if [ ! -d "${ORACLE_HOME}" ]; then
echo "${ORACLE_HOME} is not a directory, aborting" >&1
exit
fi

# ————– no more vars to edit below here ————

# ——————– Query Functions —————
run_sqlplus_query_retry()
{
# $1 script name
# $2 retry count
loop_count=1
while [ ${loop_count} -lt $2 ]; do

run_sqlplus_query "$1"

if [ $? -eq 1 ]; then
# db error – retry
loop_count=`expr $loop_count + 1`
else
return 0
fi
done
}

run_sqlplus_query()
{
#$1 is sql plus script name
result=`su – oracle -c "/home/oramon/oramon/bin/Ora_sqlplus $ORACLE_SID @${QUERY_HOME}/$1 ${ORACLE_USER} ${ORACLE_PASSWD}"`
exit_code=$?

#0 is sucess
#1 is faliure
#10 db not available (wrong db name?)
#249 is bad login
#136 invalid column name

# — Debug bits —
#echo $exit_code
#echo "$result"

if [ ${exit_code} -eq 0 ]; then
return 0
else
return 1
fi

}

run_pl_sql_query_retry()
{
# $1 script name
# $2 retry count
loop_count=0
while [ ${loop_count} -le $2 ]; do

run_pl_sql_query "$1"

if [ $? -eq 1 ]; then
# db error – retry
loop_count=`expr $loop_count + 1`
else
return 0
fi
done
}

run_pl_sql_query()
{
#$1 is pl/sql script name
# a sqlplus script is used to execute all the pl/sql stuff so we can
# get connect errors etc back
result=`su – oracle -c "/home/oramon/oramon/bin/Ora_sqlplus_rw $ORACLE_SID @${QUERY_HOME}/runplsql.sql ${ORACLE_USER} ${ORACLE_PASSWD} $1"`

exit_code=$?
# 0 is success – HOWEVER
# pl/sql may return errors but it only prints them to stdout. We have
# to check the return string for the ERROR keyword (ORA-20000 is user
# errors generated by script "raise_application_error" commands)
# 1 is failure
# 10 db not available (wrong db name?)
# 249 is bad login
# 136 invalid column name

# — DEBUG —
#echo $exit_code
#echo "$result"

if [ ${exit_code} -eq 0 ]; then
return `expr "${result}" : ".*ERROR.*"`
else
return 1
fi

}

run_lsn_check()
{
result=`su – oracle -c "/home/oramon/oramon/bin/lsn_check $ORACLE_SID"`

exit_code=$?
echo "exit_code:$exit_code"
if [ ${exit_code} -eq 0 ]; then
success=`expr "${result}" : "success"`
#echo ${success}
return ${success}
else
return 0
fi

}

# ———————– Main Script ———————

TIMEQUERY=systime.sql
FREEQUERY=dbfree.sql
WRQUERY=writeread.sql

LD_LIBRARY_PATH=${ORACLE_HOME}/lib:/lib:${LD_LIBRARY_PATH}
PATH=${ORACLE_HOME}/bin:${ORACLE_HOME}/JRE/bin:${PATH}

export LD_LIBRARY_PATH

result=""
exit_code=0
val1=""
val2=""

# ———– the Listener check ——-
if [ ${LISTENERCHECK} = 1 ]; then
run_lsn_check ${ORACLE_SID}
if [ $? -eq 1 ]; then
# db connect/retrieve fault – Error
echo "Listener faild"
exit 1
else
echo "Listener success"

fi
fi
# ———– first check if database systime is changing —-
if [ ${SYSTIMECHECK} = 1 ]; then
run_sqlplus_query_retry ${TIMEQUERY} ${RETRYS}
if [ $? -eq 1 ]; then
# db connect/retrieve fault – Error
exit 1
else
val1=${result}
fi

# delay by at least 1 second to give the db time a chance to change
sleep 1

run_sqlplus_query_retry ${TIMEQUERY} ${RETRYS}
if [ $? -eq 1 ]; then
# db connect/retrieve fault – Error
exit 1
else
val2=${result}
fi

if [ "${val1}" != "${val2}" ]; then
# we can connect and query the systime and its updating – dbs ok
echo "connection is ok"
exit 0
else
echo "connection failed"
fi
fi

# —————– then check database free space ————-
# see if dbs busy – hence didnt update systime
if [ ${DBFREECHECK} = 1 ]; then
run_sqlplus_query_retry ${FREEQUERY} ${RETRYS}
if [ $? -eq 1 ]; then
# db connect/retrieve fault – Error
exit 1
else
val1=${result}
fi

# delay by at least 1 second to give the db a chance to change size
sleep 1

run_sqlplus_query_retry ${FREEQUERY} ${RETRYS}
if [ $? -eq 1 ]; then
# db connect/retrieve fault – Error
exit 1
else
val2=${result}
fi

if [ "${val1}" != "${val2}" ]; then
# we can connect and query the db free space and its changing – dbs ok
exit 0
fi
fi

# ———– the ultimate test – write , read , delete cycle ——-
if [ ${WRITEREADCHECK} = 1 ]; then
run_pl_sql_query ${WRQUERY}
if [ $? -eq 1 ]; then
# db connect/retrieve fault – Error
exit 1
else
exit 0
fi
fi

# all enabled tests failed – database is dead
exit 1
[/css]

我们可以测试下,该监控文件是否正确,先使用上面的方法,将oracle启动起来,再执行oracle监控脚本:
[php]
[root@dc5ha1 resource.d]# cd /usr/lib/mon/mon.d
[root@dc5ha1 mon.d]# ./oracle.monitor
exit_code:0
Listener success
connection is ok
[root@dc5ha1 mon.d]# echo $?
0
[root@dc5ha1 mon.d]#
[/php]

10、修改 /etc/mon/mon.cf 文件,例如:
(点开下面的 show source,展开代码)
[css collapse=”true”]
#
# Example "mon.cf" configuration for "mon".
#
# $Id: example.cf 1.1 Sat, 26 Aug 2000 15:22:34 -0400 trockij $
#

#
# This works with 0.38pre8
#

#
# global options
#
cfbasedir = /usr/lib/mon/etc
alertdir = /usr/lib/mon/alert.d
mondir = /usr/lib/mon/mon.d
logdir = /var/log/mon
#cfbasedir = /root/heartbeat/mon-0.99.2/etc
#alertdir = /root/heartbeat/mon-0.99.2/alert.d
#mondir = /root/heartbeat/mon-0.99.2/mon.d
maxprocs = 20
histlength = 100
randstart = 60s

#
# authentication types:
# getpwnam standard Unix passwd, NOT for shadow passwords
# shadow Unix shadow passwords (not implemented)
# userfile "mon" user file
#
authtype = getpwnam

#
# NB: hostgroup and watch entries are terminated with a blank line (or
# end of file). Don’t forget the blank lines between them or you lose.
#

#
# group definitions (hostnames or IP addresses)
#

#hostgroup dbserver service
hostgroup ha 192.168.6.3
hostgroup net

#
# For the servers in building 1, monitor ping and telnet
# BOFH is on weekend call 🙂
#

watch net
service net
interval 1m
monitor network.monitor
allow_empty_group
period wd {Mon-Sun}
alert test.alert
alert ha_stop.alert
alertevery 1m

watch ha
service oracle
interval 1m
monitor oracle.monitor
allow_empty_group
period wd {Mon-Sun}
alert test.alert
alert ha_stop.alert
alertevery 1m
service fip
interval 1m
monitor ping.monitor 192.168.6.3 192.168.6.254
period wd {Mon-Sun}
alert test.altert
alert ha_stop.alert
alertevery 1m
[/css]

11、修改 /usr/lib/mon/alert.d/ha_stop.alert 文件,因为它在切换时,是停止自身服务,来达到切换目的的,但是,我想让它切换后,等待一定时间,再把服务启动起来,成为备机;
[php]
[root@dc5ha1 mon.d]# cat /usr/lib/mon/alert.d/ha_stop.alert
/etc/init.d/heartbeat stop
sleep 60
/etc/init.d/heartbeat start
[/php]

12、在两台服务器上启动heartbeat服务:
[php]
[root@dc5ha1 mon.d]# /etc/init.d/heartbeat start
Starting High-Availability services:
[ ok ]
[root@dc5ha1 mon.d]# chkconfig heartbeat on
[/php]

13、查看下日志文件
(点开下面的 show source,展开代码)
[css collapse=”true”]
[root@dc5ha2 resource.d]# tail -f /var/log/ha-log
heartbeat: 2011/06/23_14:31:15 info: Local status now set to: ‘active’
heartbeat: 2011/06/23_14:31:15 WARN: No STONITH device configured.
heartbeat: 2011/06/23_14:31:15 WARN: Shared disks are not protected.
heartbeat: 2011/06/23_14:31:15 info: Resources being acquired from dc5ha1.
heartbeat: 2011/06/23_14:31:15 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2011/06/23_14:31:15 info: No local resources [/usr/lib64/heartbeat/ResourceManager listkeys dc5ha2] to acquire.
heartbeat: 2011/06/23_14:31:15 info: Initial resource acquisition complete (T_RESOURCES(us))
heartbeat: 2011/06/23_14:31:15 info: Taking over resource group 192.168.6.3
heartbeat: 2011/06/23_14:31:15 info: Acquiring resource group: dc5ha1 192.168.6.3 oracle hh mon
heartbeat: 2011/06/23_14:31:16 info: Running /etc/ha.d/resource.d/IPaddr 192.168.6.3 start
heartbeat: 2011/06/23_14:31:16 info: /sbin/ifconfig eth0:0 192.168.6.3 netmask 255.255.255.0 broadcast 192.168.6.255
heartbeat: 2011/06/23_14:31:16 info: Sending Gratuitous Arp for 192.168.6.3 on eth0:0 [eth0]
heartbeat: 2011/06/23_14:31:16 /usr/lib64/heartbeat/send_arp -i 1010 -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-192.168.6.3 eth0 192.168.6.3 auto 192.168.6.3 ffffffffffff
heartbeat: 2011/06/23_14:31:16 info: Running /etc/ha.d/resource.d/oracle start
heartbeat: 2011/06/23_14:31:27 info: Local Resource acquisition completed. (none)
heartbeat: 2011/06/23_14:31:27 info: local resource transition completed.
heartbeat: 2011/06/23_14:31:28 info: Running /etc/ha.d/resource.d/hh start
heartbeat: 2011/06/23_14:31:32 info: Running /etc/ha.d/resource.d/mon start
heartbeat: 2011/06/23_14:32:33 info: /usr/lib64/heartbeat/mach_down: nice_failback: foreign resources acquired
heartbeat: 2011/06/23_14:32:33 info: mach_down takeover complete.
heartbeat: 2011/06/23_14:32:33 info: mach_down takeover complete for node dc5ha1.
heartbeat: 2011/06/23_14:33:21 info: Link dc5ha1:eth1 up.
heartbeat: 2011/06/23_14:33:21 info: Status update for node dc5ha1: status up
heartbeat: 2011/06/23_14:33:21 info: Link dc5ha1:eth0 up.
heartbeat: 2011/06/23_14:33:21 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2011/06/23_14:33:23 info: Status update for node dc5ha1: status active
heartbeat: 2011/06/23_14:33:23 info: remote resource transition completed.
heartbeat: 2011/06/23_14:33:23 info: Running /etc/ha.d/rc.d/status status
[/css]