Beginners Lesson – Veritas Cluster Services for System Admin

The purpose of this post is to make the Cluster concept easy for those young brothers who have just started their career as System Administrator. while writing this post I have only one thing in mind, i.e. explain the entire cluster concept with minimum usage of technical jargon and make it as simple as possible.  That’s all about the introduction, let us go to actual lesson.

In any organisation, every server in the network will have a specific purpose in terms of  it’s usage, and most of the times these servers are used to provide stable environment to run software applications that are required for organisation’s business. Usually, these applications are very critical for the business,  and organisations cannot afford to let them down even for minutes.  For Example: A bank having an application which takes care of it’s internet banking.

From the below figure you can see  an application running on a standalone server which is configured with Unix Operating System and Database( oracle / sybase / db2 /mssql … etc). And the organisation considered to run it as standalone application just because it was not critical in terms of business, and in other words the whenever the application down it wont impact the actual business.

Usually, the application clients for these application will connect to the application server using the server name , server IP or specific application IP.

Standalone Application Server

Let us assume, if the organisation is having an application which is very critical for it’s business  and any impact to the application will cause huge loss to the organisation. In that case, organisation is having  one option to reduce the impact of the application failure due to the Operating system or Hardware failure, i.e Purchasing a secondary server with same hardware configuration ,  install same kind of OS & Database, and configure it with the same application in passive mode. And “failover” the application from primary server to these secondary server whenever there is an issue with underlying hardware/operating system of primary server.

Application Server with Highly Available Configuration

What is failover?

Whenever there is an issue related to the primary server  which  make application unavailable to the client machines, the application should be moved to another available server in the network either by manual or automatic intervention. Transferring application from primary server to the secondary server and making secondary server active for the application  is called “failover” operation. And the reverse Operation (i.e. restoring application on primary server ) is called “Failback

Now we can call this configuration as application HA ( Highly Available ) setup compared to the earlier Standalone setup. you agree with me ?

Now the question is, how is this manual fail over works when there is an application issue due to Hardware/Operating System?

Manual Faiover basically involves below steps:

  1.  Application IP should failover secondary node
  2.  Same Storage  and Data  should be available on the secondary node
  3.  Finally application should failover to the secondary node.

Application Server failover to Secondary Server

Challenges in Manual Failover Configuration

  1.  Continuously monitor resources.
  2.  Time Consuming
  3.  Technically complex when it involves more dependent components for the application.

Then, what is alternative?

Just go for an automated failover software which will group the both primary server and secondary server  related to the application, and always keep an eye on primary server for any failures and failover the application to secondary server automatically when ever there is an issue with primary server.

Although we are having two different servers supporting the application, both of them are actually serving the  same purpose. And from the application client perspective they both  should be treated as single application cluster server ( composed of multiple physical servers in the background).

Wow…. Cluster .  

Now, you know that cluster is nothing but “group of individual servers working together to server the same purpose ,and appear as a single machine to the external world”.

What  are the Cluster Software available in the market, today?  There are many, depending on the Operating System and Application to be supported. Some of them native to the Operating System , and others from the third party vendor

List of Cluster Software available in the market

  • SUN Cluster Services – Native Solaris Cluster
  • Linux Cluster Server – Native Linux cluster
  • Oracle RAC – Application level cluster for Oracle database that works on different Operating Systems
  • Veritas Cluster Services – Third Party Cluster Software works on Different Operating Systems like Solaris / Linux/ AIX / HP UX.
  • HACMP – IBM AIX based Cluster Technology
  • HP UX native Cluster Technology

And In this post, we are actually discussing about VCS and its Operations. This post is not going to cover the actual implementation part or any command syntax of VCS, but will cover the concept how VCS makes application Highly Available(HA).

Note: So far, I managed to explain the concept without using much complex terminology, but now it’s time to introduce some new VCS terminology to you, which we use in every day operations of VCS.  Just keep little more focus on each new term.

VCS Components

VCS is having two types of Components 1. Physical Components 2. Logical Components

Physical Components:

1. Nodes

VCS nodes host the service groups (managed applications). Each system is connected to networking hardware, and usually also to storage hardware. The systems contain components to provide resilient management of the applications, and start and stop agents.

Nodes can be individual systems, or they can be created with domains or partitions on enterprise-class systems. Individual cluster nodes each run their own operating system and possess their own boot device. Each node must run the same operating system within a single VCS cluster.

Clusters can have from 1 to 32 nodes. Applications can be configured to run on specific nodes within the cluster.

2. Shared storage

Storage is a key resource of most applications services, and therefore most service groups. A managed application can only be started on a system that has access to its associated data files. Therefore, a service group can only run on all systems in the cluster if the storage is shared across all systems. In many configurations, a storage area network (SAN) provides this requirement.

You can use I/O fencing technology for data protection. I/O fencing blocks access to shared storage from any system that is not a current and verified member of the cluster.

3. Networking Components

Networking in the cluster is used for the following purposes:

  • Communications between the cluster nodes and the Application Clients and external systems.
  • Communications between the cluster nodes, called Heartbeat network.

Logical Components

1. Resources

Resources are hardware or software entities that make up the application. Resources include disk groups and file systems, network interface cards (NIC), IP addresses, and applications.

1.1. Resource dependencies

Resource dependencies indicate resources that depend on each other because of application or operating system requirements. Resource dependencies are graphically depicted in a hierarchy, also called a tree, where the resources higher up (parent) depend on the resources lower down (child).

1.2. Resource types

VCS defines a resource type for each resource it manages. For example, the NIC resource type can be configured to manage network interface cards. Similarly, all IP addresses can be configured using the IP resource type.

VCS includes a set of predefined resources types. For each resource type, VCS has a corresponding agent, which provides the logic to control resources.

2. Service groups

A service group is a virtual container that contains all the hardware and software resources that are required to run the managed application. Service groups allow VCS to control all the hardware and software resources of the managed application as a single unit. When a failover occurs, resources do not fail over individually— the entire service group fails over. If there is more than one service group on a system, a group may fail over without affecting the others.

A single node may host any number of service groups, each providing a discrete service to networked clients. If the server crashes, all service groups on that node must be failed over elsewhere.

Service groups can be dependent on each other. For example a finance application may be dependent on a database application. Because the managed application consists of all components that are required to provide the service, service group dependencies create more complex managed applications. When you use service group dependencies, the managed application is the entire dependency tree.

2.1. Types of service groups

VCS service groups fall in three main categories: failover, parallel, and hybrid.

  • Failover service groups

A failover service group runs on one system in the cluster at a time. Failover groups are used for most applications that do not support multiple systems to simultaneously access the application’s data.

  • Parallel service groups

A parallel service group runs simultaneously on more than one system in the cluster. A parallel service group is more complex than a failover group. Parallel service groups are appropriate for applications that manage multiple application instances running simultaneously without data corruption.

  • Hybrid service groups

A hybrid service group is for replicated data clusters and is a combination of the failover and parallel service groups. It behaves as a failover group within a system zone and a parallel group across system zones.

3. VCS Agents

Agents are multi-threaded processes that provide the logic to manage resources. VCS has one agent per resource type. The agent monitors all resources of that type; for example, a single IP agent manages all IP resources.

When the agent is started, it obtains the necessary configuration information from VCS. It then periodically monitors the resources, and updates VCS with the resource status.

4.  Cluster Communications and VCS Daemons

Cluster communications ensure that VCS is continuously aware of the status of each system’s service groups and resources. They also enable VCS to recognize which systems are active members of the cluster, which have joined or left the cluster, and which have failed.

4.1. High availability daemon (HAD)

The VCS high availability daemon (HAD) runs on each system. Also known as the VCS engine, HAD is responsible for:

    • building the running cluster configuration from the configuration files
    • distributing the information when new nodes join the cluster
    • responding to operator input
    • taking corrective action when something fails.

The engine uses agents to monitor and manage resources. It collects information about resource states from the agents on the local system and forwards it to all cluster members. The local engine also receives information from the other cluster members to update its view of the cluster.

The hashadow process monitors HAD and restarts it when required.

4.2.  HostMonitor daemon

VCS also starts HostMonitor daemon when the VCS engine comes up. The VCS engine creates a VCS resource VCShm of type HostMonitor and a VCShmg service group. The VCS engine does not add these objects to the main.cf file. Do not modify or delete these components of VCS. VCS uses the HostMonitor daemon to monitor the resource utilization of CPU and Swap. VCS reports to the engine log if the resources cross the threshold limits that are defined for the resources.

4.3.  Group Membership Services/Atomic Broadcast (GAB)

The Group Membership Services/Atomic Broadcast protocol (GAB) is responsible for cluster membership and cluster communications.

  • Cluster Membership

GAB maintains cluster membership by receiving input on the status of the heartbeat from each node by LLT. When a system no longer receives heartbeats from a peer, it marks the peer as DOWN and excludes the peer from the cluster. In VCS, memberships are sets of systems participating in the cluster.

  • Cluster Communications

GAB’s second function is reliable cluster communications. GAB provides guaranteed delivery of point-to-point and broadcast messages to all nodes. The VCS engine uses a private IOCTL (provided by GAB) to tell GAB that it is alive.

4.4. Low Latency Transport (LLT)

VCS uses private network communications between cluster nodes for cluster maintenance. Symantec recommends two independent networks between all cluster nodes. These networks provide the required redundancy in the communication path and enable VCS to discriminate between a network failure and a system failure. LLT has two major functions.

  • Traffic Distribution

LLT distributes (load balances) internode communication across all available private network links. This distribution means that all cluster communications are evenly distributed across all private network links (maximum eight) for performance and fault resilience. If a link fails, traffic is redirected to the remaining links.

  • Heartbeat

LLT is responsible for sending and receiving heartbeat traffic over network links. The Group Membership Services function of GAB uses this heartbeat to determine cluster membership.

4.5. I/O fencing module

The I/O fencing module implements a quorum-type functionality to ensure that only one cluster survives a split of the private network. I/O fencing also provides the ability to perform SCSI-3 persistent reservations on failover. The shared disk groups offer complete protection against data corruption by nodes that are assumed to be excluded from cluster membership.

5. VCS Configuration files.

5.1. main.cf

/etc/VRTSvcs/conf/config/main.cf is key file interms VCS configuration. the “main.cf”  file basically explains below information to the VCS agents/VCS daemons.

  • What are the Nodes available in the Cluster?
  • What are the Service Groups Configured for each node?
  • What are the resources available in each Service Group, the types of resources and it’s attributes?
  • What are the dependencies each resource having on other resources?
  • What are the dependencies each service group having on other Service Groups?

 5.2. types.cf

The file types.cf, which is listed in the include statement in the main.cf file, defines the VCS bundled types for VCS resources. The file types.cf is also located in the folder /etc/VRTSvcs/conf/config.

5.3. Other Important files

  • /etc/llthosts—lists all the nodes in the cluster
  • /etc/llttab—describes the local system’s private network links to the other nodes in the cluster

Sample VCS Setup

From the below figure you can understand the VCS sample setup configured for an application which is running with Database and Shared Storage.

Why we need Shared Storage for Clusters?

Normally, database servers were configured to store their database on SAN storage and it is mandatory to these database to be reachable to the all other nodes, in the cluster, in order to fail over the database from one node another node.  And That is the reason both the nodes in the below figure configured with common shared SAN storage, and in this   model all the cluster nodes can see the storage devices from their local operating systems but at a time only one node ( active ) can make write operations to the storage.

Why each server need two Storage Paths ( connected to two HBAs)?

To  provide redundancy to the server’s storage connection and  to avoid single point of failure in storage connection.  When ever you notice multiple storage paths connected to any server, you can safely assume that there is some storage multipath software running on the Operating system  e.g.  multipathd, emc powerpath, hdlm, mpio …etc.

Why each  server need two network connection to physical network?

This is again , to provide redundancy for network connection of the server and to avoid single point of failure in server physical network connectivity. When ever you see dual physical network connection, you can assume that Server is using some king of IP multipath software to mange dual path . e.g.  IPMP in solaris, NIC Bonding in linux …. etc.

Why we need minimum two Heart beat Connections, between the cluster nodes?

When the VCS lost all it’s heartbeat connections except the last one, the condition is called cluster jeopardy. When the Cluster in jeopardy state any of the below things could happen

1) The loss of the last available interconnect link
In this case, the cluster cannot reliably identify and discriminate if the last interconnect link is lost or the system itself is lost and hence the cluster will form a network partition causing two or more mini  clusters to be formed depending on the actual network partition. At this time, every Service Group that is not online on its own mini cluster, but may be online on the other mini cluster will be marked to be in an “autodisabled” state for that mini cluster until such time that the interconnect links start communicating normally.

2) The loss of an existing system which is currently in jeopardy state due to a problem
In this case, the situation is exactly the same as explained in step 1 forming two or more mini clusters.

In case where both both the LLT interconnect links disconnect at the same time and we do not have any low-pri links configured, then the cluster cannot reliably identify if it is the interconnects that have disconnected and will assume that the other system is down and now unavailable. Hence in this scenario, the cluster would consider this like a system fault and the service groups will be attempted to be onlined on each mini cluster depending upon the system StartupList defined on each Service Group. This may lead to a possible data corruption due to Applications writing to the same underlying data on storage from different systems at the same time. This Scenario is well known as “Split Brain Condition” .

Typical VCS Setup for an application with Database

 

This is all about introduction on VCS, and please stay tuned for the next posts , where I am going to discuss about actual administration of VCS.

 

Please don’t forget to drop your comments and inputs in the comment section.

Have Happy System Administration!!!!!!

 

 

 

Ramdev

Ramdev

I have started unixadminschool.com ( aka gurkulindia.com) in 2009 as my own personal reference blog, and later sometime i have realized that my leanings might be helpful for other unixadmins if I manage my knowledge-base in more user friendly format. And the result is today's' unixadminschool.com. You can connect me at - https://www.linkedin.com/in/unixadminschool/

49 Responses

  1. chris says:

    not a bad rundown, for a beginner non jargon rundown I would also include some gotcha points. (i.e. (and IMO) at least one of your heartbeats should be a crossover cable between two nodes and a full chain of them in a 2+n node solution.

    Also, I personally find that the command line is easiest for performing cumbersome tasks as well as performing troubleshooting (although I always start new people on the gui, it’s far easier to see the “big picture” there). Things even a beginner should know are commands like hastatus and it’s equivalents for various versions and functions, a lot of information is available for very little typing.

    lastly a jargon==realworld dictionary might be helpful, it’s been my experience that people new to VCS tend to get overwhelmed by acronyms that they have never needed to know before as well as custom terms that really did not need to be invented for reasons other than obfuscation (I’m lookin at you Plex..)

    meh, just my $.02

    -C

    • Gurkulindia Gurkulindia says:

      Hello Chris, Thanks for your comments. From your impressive profile, I have noticed that you already mastered in the art of teaching,.

      learning minds always having three challenges … learning Procedure, actual technology and understanding it’s right usage. you might have already noticed that I am trying to simplify the first and last challenges, just to give those minds enough strength to break the middle one on it’s own
      .
      Please Keep visit to the site, and share your inputs for better posts.. Thanks again.

  2. Chris2 says:

    Yes Chris,

    You are wright about command line being the easiest way to see the picture/administer the VCS. Let’s not forget about the hastatus -summary (which as you already know presents just a summary output, in order to quick identify the VCS group having problems). Also halog -info = displays the agent log location, log that can also be used to catch better picture of what’s going on…
    Cheers,
    Chris

  3. venu says:

    Hi,

    Its really good and greate affert with VCS information, can you post about actual administration of VCS, i am waiting with that.

    thanks venukumar

  4. cswray says:

    Heartbeat crossovers can cause issues for heartbeat. They are a great idea but I’ve seen where they can cause issues. A switch cable setting for the crossover, I do believe, is the Veritas recommendation.

    • Gurkulindia Gurkulindia says:

      @cswray – you are correct and switch cable heartbeat is mandatory if we think about global clusters which are located in different places. I have indicated heartbeats in th post just to make the concept simple.

  5. Bas says:

    Hi Ramdev,

    Good Article, its very simple and easy to understand for beginners. Can you provide hierarchy info of the protocol HAD-GAB-LLT. waiting for next post administrations of VCS.

  6. ganesh says:

    hi, this is very great to me to understand , please include real time issues also.

    we are friends are waiting for ur new post.

    thanks a lot.

  7. Work to encourage … Go ahead! Who is approaching for the first time in VERITAS Cluster Server (VCS) can also be helpful to know the fundamentals and some concepts of VERITAS Volume Manager (VxVM). Here you will find a course free of Symantec “Veritas Storage Foundation 4.x for UNIX”[1]. Here [2] my VERITAS Quick Reference recap.

    1.
    2.

  8. anuj says:

    I installed VCS 5.1 on Solaris10-x86 running on VmWare .For Hearbeat i added 2 Virtual NIC on two nodes.Now Total i have 3 NIC on both nodes.

    It installed successfully , my two nodes shows running state .

    But facing one issue – not able to configre heartbeat.When i fire the command #hahb -display

  9. laotsao says:

    two observations
    1
    oracle rac is an application of oracle clusterware for oracle db, not by itself cluster software
    2
    comment on the crossover cable, is for 10/100 mbit, ge use stright cables, oracle rac require ethernet switches and does not support direct cabling between two nodes

  10. ranjith says:

    Hi …

    I installed VCS 5.1 on my vertual machines in my laptop and I am able to see the nodes are running fine from both the nodes(hastatus -summ)…Could any please let me know what are all the versions of VxVM can be used for VCS 5.1 x86..?

    Thanks in Advance..
    Ranjith

  11. Ramdev Ramdev says:

    @Ranjith, I dont think you have any dependency between VCS and VxVM. You are free to go any available version of VxVM. VCS deals with only volumes, and dont care aboout volume manager.

  12. Yogesh Raheja says:

    @Ranjit, Ram is absolutely right you have complete freedom to use any of the vxvm versions. But as per standards latest version is preferred. As we all are aware that there are so many modification/new feature introduced in vxvm4.0 onwards so vxvm5.0 should be preferred over 4.0 and below versions. Its only standard but not recommendation.:)

  13. Manjunath L says:

    Hi Yogesh/Ramdev, This is very good and easy to understand concept you have provied. I started going through this website regularly. Overall this is very good knowledge site for any Unix Solaris System Admin. Appreciate your effort. :)

  14. Yogesh Raheja says:

    @Manjunath, thank you very much for your kind words and interest in our gurkulindia.

  15. Mahendran says:

    Actually i installed VCS in my Active/Standby servers. It is showing that two heartbeat links are down now,, I do not know about this down in links,,, Please help me..

  16. HariBhakta says:

    It was very well explained and even the first timer who does not know anything about cluster can very well follow this document and get live experience on this.

  17. Rama says:

    This is very nice to see all at once for beginners. We really appreciate your hardwork and dedication.

  18. Yogesh Raheja says:

    @Rama, thank you very much for your interest and comment..

  19. sekhar says:

    hi anuj,

    i have seen ur comment, am looking for vcs 5.1 to install my x86 system in my laptop,

    please send the link.

  20. Rahul says:

    Hi Team

    This was very nice reading when went though it would provide me base start to work in VCS. Waiting your next input how to administrator VCS but still dont have any idea what is VXVM .
    Could you please provide some basic concept like to you given with VCS ?

    Thanks for sharing you knowledge
    Rahul Singh

  21. Santanu says:

    This lesson has been clearly written with easy words. Really it is good to build clear concept in VCS.

  22. Ramdev Ramdev says:

    Santanu – Thanks for the feedback. 

  23. Syed Rahman says:

    Hi

    This is very nice reading when I went though, it would provide me idea to start work in VCS. Could you please provide some basic concept about SUN Cluster 3.2
    Thanks for sharing you knowledge. Syed Rahman

    • Ramdev Ramdev says:

      Hi Syed, Thanks for the comment. Time is our constraint for now. We will start post on various cluster concepts sometime in january, until then our main focus on operating systems.sorry about that.

  24. nageswararao says:

    hi
    this is very good in giving the concept to the beginners

  25. Yogesh Raheja says:

    @Nageswararao, thanks for your feedback!..

  26. Harikrishna says:

    Nice explaination… and good for VCS beginners.. Thank you Ramdev..!!!

  27. hari says:

    good xplanation,I understood overview of clustere,can u post some scenarios about clusteres

  28. Dhanabal M says:

    Very useful information for beginners like me. :) Thanks a lot :)

  29. snehal says:

    Hi,

    Do anyone having some Docs about how to configure VCS cluster in AIX environment.

  30. saty says:

    very good explained…

  31. Musab says:

    Very clear and neat explanation, atleast now I got my concepts in VCS, thanks

  32. Anil says:

    very help full in understanding the VCS concepts. Thanks again.

    As Request is if you could give some practical examples in building the VCS will help us to understand practical knowledge as well

  33. theja says:

    thank you very much…i’m newbie…but understood the concept clearly in single reading….waiting for real admin VCS :) my best wishes :)

  34. vendhan says:

    This artcile really helped me for the interview.. Thanks buddy.

  35. amit says:

    Hey….his is very useful to understand the basic idea about Cluster and VCS….thanks a lot for sharing this knowledge stuff…

  36. Saran says:

    Hello Ramdev,

    Your blog is very informative. Much appreciate the time and efforts you have invested for this. Thank you very much indeed!

  37. Hello Ramdev,

    your explanation very nice .Please provide the actual configuration of vcs cluster .

    Thanks..
    Sivanjireddy.I

  38. Vinayak says:

    Its awesome article which clears the all basic concepts of VCS. Suberbbb article.
    Thanks

  39. Sunny says:

    Nice article for intro to VCS

  40. Amar says:

    Its really awesome article which clear the all basic concepts of VCS & also the images & example is very simple to understand.
    Thanks for sharing such articles.

  1. September 18, 2015

    […] Read – VCS for Beginners – Getting Started with VCS […]

What is in your mind, about this post ? Leave a Reply

Close
  Our next learning article is ready, subscribe it in your email

What is your Learning Goal for Next Six Months ? Talk to us