首页 > HA Cluster, Linux > HeartBeat1.0 配置实例

HeartBeat1.0 配置实例

最近配置了一套HB1.0双机,现在写写大概过程,以免遗忘。

实施环境:

两套Red Flag DC5sp4 64位 系统
浮动ip  192.168.6.3
ha1 ip 192.168.6.1(eth0)  10.10.1.11(eth1)
ha2 ip 192.168.6.2(eth0)  10.10.1.12(eth1)
网关 192.168.6.254
模拟一块 20G 存储
oracle 10.2.0.1
HeartBeat 1.0 

实施过程:
1、安装系统,此处不做详述;

2、在其他系统上,使用iscsi服务,模拟一块20G的存储给这两台服务器,并测试存储识别正常,关于如何配置iscsi服务,也可以在本站的其他文章内找到,在此不做详述;

3、安装oracle 10.2.0.1 ,安装过程比较简单,有不明白之处,谷哥、度娘都可以解决,所以在此也不做详述,简单声明下几个变量:
$ORACLE_SID orcl
$ORACLE_HOME /opt/app/oracle/product/10.2.0/db_1
oracle system 用户密码 asdasd

4、安装 mon-0.99.2

5、修改hosts文件,配置主机名和ip地址

[root@dc5ha2 ~]# cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost localhost.localdomain localhost
192.168.6.1             dc5ha1
10.10.1.11              dc5ha1
192.168.6.2             dc5ha2
10.10.1.12              dc5ha2

6、进入/etc/ha.d 目录中,配置authkeys和ha.cf文件:

[root@dc5ha2 ha.d]# cat authkeys
auth 1
1 sha1 rf
[root@dc5ha2 ha.d]# cat ha.cf
bcast   eth1 eth0
keepalive       2
deadtime        30
udpport 694
auto_failback off
node    dc5ha1
node    dc5ha2
logfile /var/log/ha-log

7、修改 /etc/ha.d/haresourse 文件:

[root@dc5ha2 ha.d]# cat haresources
dc5ha1 192.168.6.3 oracle  mon

8、进入/etc/ha.d/resource.d 目录中,修改oracle文件,该文件就是oracle资源的起停脚本文件,例如:
(点开下面的 show source,展开代码)

#!/bin/sh
# $Id: S30oracle,v 1.1 2000/09/27 14:40:55 ar Exp $
#
# Script:       $RCSfile: S30oracle,v $
#
# Description:  Example Oracle control script
#               (Based on dbora example in Oracle docs)
#               Requires: dbstart, dbshut from Oracle, oratab file
#               Modify as required
#
# Platform:     Unix
#
# Author:       Ade Rixon - RSi Solutions Ltd
#

#
# Oracle configuration (EDIT)
#

#
# standard variables and functions
#

script="`basename $0`"

#
# args: <start|stop> [no_of_attempts]
#
state=$1                # starting or stopping?

#
ORACLE_OWNER=oracle
export ORACLE_OWNER
ORACLE_HOME=/opt/app/oracle/product/10.2.0/db_1
ORACLE_SID=orcl

#
# decide action based on first argument
#
case "${state}" in

'start')
        rm -f /tmp/oramon-oracle-orcl.pid


        mount /dev/sdb1 /oradata
        if [ $? -ne 0 ]
        then
                sleep 1
                mount /dev/sdb1 /oradata

        fi

                su - "$ORACLE_OWNER" -c "/home/oramon/oramon/bin/Ora_start $ORACLE_HOME  $ORACLE_SID"
        ;;

'stop')
        su - "$ORACLE_OWNER" -c "/home/oramon/oramon/bin/Ora_stop $ORACLE_HOME $ORACLE_SID"
        umount /dev/sdb1

        if [ $? -ne 0 ]
        then
                sleep 1
                if  df | grep "sdb1"
                then
                        fuser -s -m -k /dev/sdb1 >/dev/null 2>&1
                else
                        exit 0
                fi

                sleep 3

                if  df | grep "sdb1"
                then
                        fuser -s -m -k /dev/sdb1 >/dev/null 2>&1
                else
                        exit 0
                fi
                sleep 1
                umount  /dev/sdb1

        fi
        exit 0
        ;;

*)
        echo "Usage: $0 <start|stop>"
        exit 1  # warning code
        ;;

esac

exit 0  # OK code (default)

其中,起停均调用的是/home/oramon/oramon/bin 下面的命令,所以需要执行下面操作:

[root@dc5ha1 /]# cd  /home
[root@dc5ha1 home]# chown -R oracle:oinstall oramon/

顺便来看看调用的起停脚本文件内容:
(点开下面的 show source,展开代码)

[root@dc5ha1 bin]# cat Ora_start
#!/bin/sh
ORACLE_HOME=$1
ORACLE_SID=$2
#ORACLE_HOME=/home/oracle/oracle/product/10.2.0/db_1
export ORACLE_HOME
#ORACLE_SID=oracle
export ORACLE_SID

lsnrctl start
#$ORACLE_HOME/bin/sqlplus <<!
#dbstart
$ORACLE_HOME/bin/sqlplus '/ as sysdba' <<!
startup
exit
!
[root@dc5ha1 bin]# cat Ora_stop
#!/bin/sh
ORACLE_HOME=$1
ORACLE_SID=$2
#ORACLE_HOME=/home/oracle/oracle/product/10.2.0/db_1
export ORACLE_HOME
#ORACLE_SID=oracle
export ORACLE_SID

lsnrctl stop
#$ORACLE_HOME/bin/sqlplus <<!
$ORACLE_HOME/bin/sqlplus '/ as sysdba' <<!
shutdown immediate
exit
!

我们可以手动启停下,测试下脚本是否正确:
(点开下面的 show source,展开代码)

[root@dc5ha1 /]# cd /etc/ha.d/resource.d/
[root@dc5ha1 resource.d]# ./oracle start

LSNRCTL for Linux: Version 10.2.0.1.0 - Production on 23-6�� -2011 14:28:22

Copyright (c) 1991, 2005, Oracle.  All rights reserved.

Starting /opt/app/oracle/product/10.2.0/db_1/bin/tnslsnr: please wait...

TNSLSNR for Linux: Version 10.2.0.1.0 - Production
System parameter file is /opt/app/oracle/product/10.2.0/db_1/network/admin/listener.ora
Log messages written to /opt/app/oracle/product/10.2.0/db_1/network/log/listener.log
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1)))
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=dc5ha1)(PORT=1521)))

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC1)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER
Version                   TNSLSNR for Linux: Version 10.2.0.1.0 - Production
Start Date                23-6�� -2011 14:28:24
Uptime                    0 days 0 hr. 0 min. 0 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /opt/app/oracle/product/10.2.0/db_1/network/admin/listener.ora
Listener Log File         /opt/app/oracle/product/10.2.0/db_1/network/log/listener.log
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=dc5ha1)(PORT=1521)))
Services Summary...
Service "PLSExtProc" has 1 instance(s).
  Instance "PLSExtProc", status UNKNOWN, has 1 handler(s) for this service...
The command completed successfully

SQL*Plus: Release 10.2.0.1.0 - Production on ������ 6�� 23 14:28:24 2011

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> ORACLE instance started.

Total System Global Area  276824064 bytes
Fixed Size                  2020160 bytes
Variable Size             142609600 bytes
Database Buffers          130023424 bytes
Redo Buffers                2170880 bytes
Database mounted.
Database opened.
SQL> Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production
With the Partitioning, OLAP and Data Mining options
[root@dc5ha1 resource.d]# echo $?
0
[root@dc5ha1 resource.d]#
[root@dc5ha1 resource.d]# ./oracle stop

LSNRCTL for Linux: Version 10.2.0.1.0 - Production on 23-6�� -2011 14:30:19

Copyright (c) 1991, 2005, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC1)))
The command completed successfully

SQL*Plus: Release 10.2.0.1.0 - Production on ������ 6�� 23 14:30:29 2011

Copyright (c) 1982, 2005, Oracle.  All rights reserved.


Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production
With the Partitioning, OLAP and Data Mining options

SQL> Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production
With the Partitioning, OLAP and Data Mining options
[root@dc5ha1 resource.d]# echo $?
0
[root@dc5ha1 resource.d]#   

9、到/usr/lib/mon/mon.d 目录下修改oracle监控文件,例如:
(点开下面的 show source,展开代码)

#!/bin/sh
# $Id: oramon,v 1.4 2000/04/05 16:45:05 ar Exp $
#
# Script:       oramon.sh
#
# Description:  Oracle test script
#
# Platform:     Unix
#
# Authors:      Vincent Sanders - RSi
#

ORACLE_BASH=/home/oracle
. /home/oracle/.bash_profile
#. /home/oracle/.bash_profile

# defaults
# retry count
RETRYS=3
# location of sql query files
QUERY_HOME=/home/oramon/oramon/lib
# enable or disable the checks (1/0)
# you must enable at least one
LISTENERCHECK=1
SYSTIMECHECK=1
DBFREECHECK=1
WRITEREADCHECK=0

script="`basename $0`"

#if [ $# -ne 3 ]; then
#       echo "Usage: ${script} <username> <password>" >&1
#       exit
#fi

#ORACLE_SID=$1
#ORACLE_USER=$2
#ORACLE_PASSWD=$3
ORACLE_SID="orcl"
ORACLE_USER="system"
ORACLE_PASSWD="asdasd"
# check for required environment vars
if [ ${ORACLE_HOME:-''} = '' -o ${ORACLE_SID:-''} = '' ]; then
        echo "Oracle environment not set, aborting" >&1
        exit
fi
if [ ! -d "${ORACLE_HOME}" ]; then
        echo "${ORACLE_HOME} is not a directory, aborting" >&1
        exit
fi

# -------------- no more vars to edit below here ------------

# -------------------- Query Functions ---------------
run_sqlplus_query_retry()
{
    # $1 script name
    # $2 retry count
    loop_count=1
    while [ ${loop_count} -lt $2 ]; do

        run_sqlplus_query "$1"

        if [ $? -eq 1 ]; then
            # db error - retry
            loop_count=`expr $loop_count + 1`
        else
            return 0
        fi
    done
}

run_sqlplus_query()
{
    #$1 is sql plus script name
    result=`su - oracle -c "/home/oramon/oramon/bin/Ora_sqlplus $ORACLE_SID @${QUERY_HOME}/$1 ${ORACLE_USER} ${ORACLE_PASSWD}"`
    exit_code=$?

    #0 is sucess
    #1 is faliure
    #10 db not available (wrong db name?)
    #249 is bad login
    #136 invalid column name

    # -- Debug bits --
    #echo $exit_code
    #echo "$result"

    if [ ${exit_code} -eq 0 ]; then
        return 0
    else
        return 1
    fi

}

run_pl_sql_query_retry()
{
    # $1 script name
    # $2 retry count
    loop_count=0
    while [ ${loop_count} -le $2 ]; do

        run_pl_sql_query "$1"

        if [ $? -eq 1 ]; then
            # db error - retry
            loop_count=`expr $loop_count + 1`
        else
            return 0
        fi
    done
}


run_pl_sql_query()
{
    #$1 is pl/sql script name
    # a sqlplus script is used to execute all the pl/sql stuff so we can
    # get connect errors etc back
    result=`su - oracle -c "/home/oramon/oramon/bin/Ora_sqlplus_rw $ORACLE_SID @${QUERY_HOME}/runplsql.sql ${ORACLE_USER} ${ORACLE_PASSWD} $1"`

    exit_code=$?
    # 0 is success - HOWEVER
        # pl/sql may return errors but it only prints them to stdout. We have
        # to check the return string for the ERROR keyword (ORA-20000 is user
        # errors generated by script "raise_application_error" commands)
    # 1 is failure
    # 10 db not available (wrong db name?)
    # 249 is bad login
    # 136 invalid column name

    # -- DEBUG --
    #echo $exit_code
    #echo "$result"

    if [ ${exit_code} -eq 0 ]; then
        return `expr "${result}" : ".*ERROR.*"`
    else
        return 1
    fi

}

run_lsn_check()
{
    result=`su - oracle -c "/home/oramon/oramon/bin/lsn_check $ORACLE_SID"`

    exit_code=$?
    echo "exit_code:$exit_code"
    if [ ${exit_code} -eq 0 ]; then
        success=`expr "${result}" : "success"`
        #echo ${success}
        return ${success}
    else
        return 0
    fi

}

# ----------------------- Main Script ---------------------

TIMEQUERY=systime.sql
FREEQUERY=dbfree.sql
WRQUERY=writeread.sql

LD_LIBRARY_PATH=${ORACLE_HOME}/lib:/lib:${LD_LIBRARY_PATH}
PATH=${ORACLE_HOME}/bin:${ORACLE_HOME}/JRE/bin:${PATH}

export LD_LIBRARY_PATH

result=""
exit_code=0
val1=""
val2=""

# ----------- the Listener check    -------
if [ ${LISTENERCHECK} = 1 ]; then
    run_lsn_check ${ORACLE_SID}
    if [ $? -eq 1 ]; then
        # db connect/retrieve fault - Error
        echo "Listener faild"
        exit 1
    else
        echo "Listener success"

    fi
fi
# ----------- first check if database systime is changing ----
if [ ${SYSTIMECHECK} = 1 ]; then
    run_sqlplus_query_retry ${TIMEQUERY} ${RETRYS}
    if [ $? -eq 1 ]; then
        # db connect/retrieve fault - Error
        exit 1
    else
        val1=${result}
    fi

    # delay by at least 1 second to give the db time a chance to change
    sleep 1

    run_sqlplus_query_retry ${TIMEQUERY} ${RETRYS}
    if [ $? -eq 1 ]; then
        # db connect/retrieve fault - Error
        exit 1
    else
        val2=${result}
    fi

    if [ "${val1}" != "${val2}" ]; then
        # we can connect and query the systime and its updating - dbs ok
        echo "connection is ok"
        exit 0
    else
        echo "connection failed"
    fi
fi

# ----------------- then check database free space -------------
# see if dbs busy - hence didnt update systime
if [ ${DBFREECHECK} = 1 ]; then
    run_sqlplus_query_retry ${FREEQUERY} ${RETRYS}
    if [ $? -eq 1 ]; then
        # db connect/retrieve fault - Error
        exit 1
    else
        val1=${result}
    fi

    # delay by at least 1 second to give the db a chance to change size
    sleep 1

    run_sqlplus_query_retry ${FREEQUERY} ${RETRYS}
    if [ $? -eq 1 ]; then
        # db connect/retrieve fault - Error
        exit 1
    else
        val2=${result}
    fi

    if [ "${val1}" != "${val2}" ]; then
        # we can connect and query the db free space and its changing - dbs ok
        exit 0
    fi
fi

# ----------- the ultimate test - write , read , delete cycle -------
if [ ${WRITEREADCHECK} = 1 ]; then
    run_pl_sql_query ${WRQUERY}
    if [ $? -eq 1 ]; then
        # db connect/retrieve fault - Error
        exit 1
    else
        exit 0
    fi
fi

# all enabled tests failed - database is dead
exit 1

我们可以测试下,该监控文件是否正确,先使用上面的方法,将oracle启动起来,再执行oracle监控脚本:

[root@dc5ha1 resource.d]# cd /usr/lib/mon/mon.d
[root@dc5ha1 mon.d]# ./oracle.monitor
exit_code:0
Listener success
connection is ok
[root@dc5ha1 mon.d]# echo $?
0
[root@dc5ha1 mon.d]# 

10、修改 /etc/mon/mon.cf 文件,例如:
(点开下面的 show source,展开代码)

#
# Example "mon.cf" configuration for "mon".
#
# $Id: example.cf 1.1 Sat, 26 Aug 2000 15:22:34 -0400 trockij $
#

#
# This works with 0.38pre8
#

#
# global options
#
cfbasedir   = /usr/lib/mon/etc
alertdir    = /usr/lib/mon/alert.d
mondir      = /usr/lib/mon/mon.d
logdir     = /var/log/mon
#cfbasedir   = /root/heartbeat/mon-0.99.2/etc
#alertdir    = /root/heartbeat/mon-0.99.2/alert.d
#mondir      = /root/heartbeat/mon-0.99.2/mon.d
maxprocs    = 20
histlength = 100
randstart = 60s

#
# authentication types:
#   getpwnam      standard Unix passwd, NOT for shadow passwords
#   shadow        Unix shadow passwords (not implemented)
#   userfile      "mon" user file
#
authtype = getpwnam

#
# NB:  hostgroup and watch entries are terminated with a blank line (or
# end of file).  Don't forget the blank lines between them or you lose.
#

#
# group definitions (hostnames or IP addresses)
#

#hostgroup dbserver service
hostgroup ha 192.168.6.3
hostgroup net

#
# For the servers in building 1, monitor ping and telnet
# BOFH is on weekend call 🙂
#

watch net
    service net
        interval 1m
        monitor network.monitor
        allow_empty_group
        period wd {Mon-Sun}
            alert test.alert
            alert ha_stop.alert
            alertevery 1m

watch ha
    service oracle
        interval 1m
        monitor oracle.monitor
        allow_empty_group
        period wd {Mon-Sun}
            alert test.alert
            alert ha_stop.alert
            alertevery 1m
    service fip
        interval 1m
        monitor ping.monitor 192.168.6.3 192.168.6.254
        period wd {Mon-Sun}
            alert test.altert
            alert ha_stop.alert
            alertevery 1m

11、修改 /usr/lib/mon/alert.d/ha_stop.alert 文件,因为它在切换时,是停止自身服务,来达到切换目的的,但是,我想让它切换后,等待一定时间,再把服务启动起来,成为备机;

[root@dc5ha1 mon.d]# cat /usr/lib/mon/alert.d/ha_stop.alert
/etc/init.d/heartbeat stop
sleep  60
/etc/init.d/heartbeat start

12、在两台服务器上启动heartbeat服务:

[root@dc5ha1 mon.d]# /etc/init.d/heartbeat start
Starting High-Availability services:
                                                           [  ok  ]
[root@dc5ha1 mon.d]# chkconfig heartbeat on

13、查看下日志文件
(点开下面的 show source,展开代码)

[root@dc5ha2 resource.d]# tail -f /var/log/ha-log
heartbeat: 2011/06/23_14:31:15 info: Local status now set to: 'active'
heartbeat: 2011/06/23_14:31:15 WARN: No STONITH device configured.
heartbeat: 2011/06/23_14:31:15 WARN: Shared disks are not protected.
heartbeat: 2011/06/23_14:31:15 info: Resources being acquired from dc5ha1.
heartbeat: 2011/06/23_14:31:15 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2011/06/23_14:31:15 info: No local resources [/usr/lib64/heartbeat/ResourceManager listkeys dc5ha2] to acquire.
heartbeat: 2011/06/23_14:31:15 info: Initial resource acquisition complete (T_RESOURCES(us))
heartbeat: 2011/06/23_14:31:15 info: Taking over resource group 192.168.6.3
heartbeat: 2011/06/23_14:31:15 info: Acquiring resource group: dc5ha1 192.168.6.3 oracle hh mon
heartbeat: 2011/06/23_14:31:16 info: Running /etc/ha.d/resource.d/IPaddr 192.168.6.3 start
heartbeat: 2011/06/23_14:31:16 info: /sbin/ifconfig eth0:0 192.168.6.3 netmask 255.255.255.0    broadcast 192.168.6.255
heartbeat: 2011/06/23_14:31:16 info: Sending Gratuitous Arp for 192.168.6.3 on eth0:0 [eth0]
heartbeat: 2011/06/23_14:31:16 /usr/lib64/heartbeat/send_arp -i 1010 -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-192.168.6.3 eth0 192.168.6.3 auto 192.168.6.3 ffffffffffff
heartbeat: 2011/06/23_14:31:16 info: Running /etc/ha.d/resource.d/oracle  start
heartbeat: 2011/06/23_14:31:27 info: Local Resource acquisition completed. (none)
heartbeat: 2011/06/23_14:31:27 info: local resource transition completed.
heartbeat: 2011/06/23_14:31:28 info: Running /etc/ha.d/resource.d/hh  start
heartbeat: 2011/06/23_14:31:32 info: Running /etc/ha.d/resource.d/mon  start
heartbeat: 2011/06/23_14:32:33 info: /usr/lib64/heartbeat/mach_down: nice_failback: foreign resources acquired
heartbeat: 2011/06/23_14:32:33 info: mach_down takeover complete.
heartbeat: 2011/06/23_14:32:33 info: mach_down takeover complete for node dc5ha1.
heartbeat: 2011/06/23_14:33:21 info: Link dc5ha1:eth1 up.
heartbeat: 2011/06/23_14:33:21 info: Status update for node dc5ha1: status up
heartbeat: 2011/06/23_14:33:21 info: Link dc5ha1:eth0 up.
heartbeat: 2011/06/23_14:33:21 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2011/06/23_14:33:23 info: Status update for node dc5ha1: status active
heartbeat: 2011/06/23_14:33:23 info: remote resource transition completed.
heartbeat: 2011/06/23_14:33:23 info: Running /etc/ha.d/rc.d/status status


转载本站文章请注明,来自: Edward's Blog

本文链接: http://www.edward-han.com/177.html

分类: HA Cluster, Linux 标签:
  1. 2011年7月11日14:35 | #1

    写得很详细啊~不错!

  2. Edward_Han
    2011年7月12日15:07 | #2

    @oracle学习
    呵呵
    谢谢~~
    不算详细 就是一个概括笔记

  1. 本文目前尚无任何 trackbacks 和 pingbacks.
您必须启用 javascript 才能在这儿看到验证码!

无觅相关文章插件,快速提升流量

加载中……