Best JSS Fail Over Practice

Cem
Valued Contributor

Hi Guys,

What is the best practice for JSS Fail Over, we are looking for automated fail over solution.....any suggestions welcome.

Cheers

Cem

-----------------------------------------

31 REPLIES 31

donmontalvo
Esteemed Contributor III

Hi Cem,

I set up a High Availability (HA) configuration for one of our Santa Fe customers a few years ago. I was able to set up two Xserves running 10.4, with heart beat failover. We had a fiber attached RAID that needed to be scripted to switch to the failover server as well, but I'm pretty sure if this is for JSS it's not needed. Here's a link to the 10.4 HA doc, sure you'll find updated docs if you surf around a bit.

http://manuals.info.apple.com/en/High_Availability_Admin_v10.4.pdf

PS, not sure if my 10.4 notes would help, happy to share if you contact me offlist.
PS2, feature request for JAMF...support for running JSS on Windows (so our datacenter folks can support hardware, OS and backups)

Don

--
https://donmontalvo.com

Bukira
Contributor

Hi,

What kind of failover ? DP or the Web service,

I raised this is JAMF as a feature request as there is no built in failover redundancy apart from a backup and a restore of the database, which takes user interaction

Regards

Criss

Criss Myers
Senior Customer Support Analyst (Mac Services)
iPhone Developer
Apple Certified Technical Coordinator v10.5
LIS Development Team
Adelphi Building AB28
University of Central Lancashire
Preston PR1 2HE
Ex 5054
01772 895054

Cem
Valued Contributor

Hi Criss,

We are looking for complete easy way of replacing the server box if it is packed-up. This will be done by team that they have no Mac knowledge and it needs to be complete swap out or just automated take over by the second server.

Cem

tlarkin
Honored Contributor

The problem with JSS failover is that you would have to have a second server, with the same IP/DNS because the client will not know where to look. So I guess you can have a second box set up and powered off with the same IP/DNS and if the primary JSS fails you can boot up the fail over and let it take over.

Another option would be to mass edit the /etc/jamf.conf file which stores the JSS info locally via ARD Admin or something else.

-Tom

Cem
Valued Contributor

First solution still needs tweaking as database not synced.

Second solution requires over 540 clients to be accessed via ARD.

...considering none Mac guys needs to handle this action, both are a bit
involved.

But it can be 3rd solution;

What if I put second Server as Target Mode via FireWire and run Carbon Copy
Cloner on daily basis (around 3am after database backup). This way it will
be exact copy cat of the main server. If the server fails, just switch off
and on the second server. Just a thought...what do you think?

Cem

stevewood
Honored Contributor II
Honored Contributor II

I don't know about anyone else, but that seems like a lot of heavy drive
work. I know MTB on a hard drive is up there, but wiping the drive every
night and putting a complete clone on the system seems like overkill.

Why not use DNS instead to handle the IP address? A quick edit of a DNS
record will generally take only a few minutes. If the TTL on your DNS
server is at 1 hr for internal records, the most you'd be down is that long.

Setup a script to sync the database backup over to the backup JSS box with a
launchd task set to shutdown mySQL, restore the database, and bring mySQL
back up on the backup box.

Failure of the primary JSS happens, edit the DNS records and you are back up
and running.

Just my two pennies.

Steve Wood
Director of IT
swood at integer.com

The Integer Group | 1999 Bryan St. | Ste. 1700 | Dallas, TX 75201
T 214.758.6813 | F 214.758.6901 | C 940.312.2475

Cem
Valued Contributor

That sounds good already.

Setup a script to sync the database backup over to the backup JSS box with a launchd task set to shutdown mySQL, restore the database, and bring mySQL back up on the backup box.

Only thing is I am not a script head. Steve do you have any idea where can I get hold of such a script?

Also I would like to ask a question to Thomas Larkin; have you got a website for scripts that you are kindly sharing? I may need the one for partitioning the HDD.

Thanks in advance

Cem

stevewood
Honored Contributor II
Honored Contributor II

To be honest, I'm not a big mySQL head, so I'm not 100% certain how to
accomplish this, but something like:

#########

#!/bin/bash

# shutdown mysql
/usr/local/mysql/bin/mysqladmin shutdown

# copy over the backup from primary server

scp primaryjss.mycompany.com:/some/file/path/jsssqlbackup.gz
/my/local/path/to/sql

# unzip backup

gunzip jsssqlbackup.gz

# restart mysql
/usr/local/mysql/bin/mysqladmin start

exit 0

########

Again, not being a mySQL geek I'm not sure it would work, but this concept
should work. And of course, there is no error checking in that, and you'd
probably need to setup shared SSH keys so that the scp command would work.

Not sure if this method will work for sharing the ssh keys, but here is an
article that is up on AFP548 from a few years ago:

http://www.afp548.com/article.php?story040816224717742&query=ssh%2Bkeys

<http://www.afp548.com/article.php?story040816224717742&query=ssh%2Bkeys>

I'm sure someone else on list might be able to clean it up some.

As for the partiitioning scripts that Thomas has, there was another thread
going that I sent those scripts out to. Check the list archive from today.

Steve Wood
Director of IT
swood at integer.com

The Integer Group | 1999 Bryan St. | Ste. 1700 | Dallas, TX 75201
T 214.758.6813 | F 214.758.6901 | C 940.312.2475

Cem
Valued Contributor

Thanks Don,

Actually Failover IP will be be the best solution with incremental backup
using rsync or CCC. (I wonder CCC will work)

Anyone out there tried this with JSS Server?

Here is the latest doc;
http://manuals.info.apple.com/en_US/File_Services_Admin_v10.5.pdf

It doesn't mention anything in 10.6 Server Manual, but my Apple contact
confirmed nothing has been changed.

Cem

donmontalvo
Esteemed Contributor III

Hi Cem,

A belated thank you for the 10.5 doc. I was pulling my hair out looking for a 10.5 version of HA, looks like I was looking in the wrong place! I'll need to refresh myself with this, since I'll need to do this in the coming weeks for one environment.

Don

--
https://donmontalvo.com

Cem
Valued Contributor

Hi,

I have been spending half of the day to get this IP Failover working for my
new Intel MacOSx10.6.3 Servers....but no luck

It takes over and sends the notification email correctly. But when the
primary server is up and running again. The backupserver sees the running
primary server and still tries to get its IP. This puts the the Primary
Server off the network with error message "Another device on the network is
using your computer's IP address"

Am I doing something wrong?? I have followed both Command Line Admin and
File Services Admin Manuals step by step. Also asked my best friend
Google...but no luck

I am sure there is a hero somewhere to help.

Cheers
Cem

donmontalvo
Esteemed Contributor III

When we worked with an Apple SE to get HA in place, there was little documentation (10.4 days). We were successful in getting Fail Over to work on te server and Fibre switched RAID, but I do remember the same frustration. Fail Back was a manual procedure that had to be done off hours. Have you engaged Apple to see what they recommend?

Don

--
https://donmontalvo.com

Cem
Valued Contributor

Apple told me this is expected behaviour.

If I have to do it manually, this is what I am thinking;

Master Server fails

-->Backup Server takes IP over automatically (because Casper database
already rsynced it is up to date and service won't be interrupted)

-->Master Server repaired or recovered, but cannot use its Unique IP

-->Remove IP Failover info from /etc/hostconfig from Backup Server -->reboot

-->Boot or reboot Master Server (after repair or recovery) IP is now useable

-->Add the info back into /etc/hostconfig file on Backup Server-->reboot

-->IP Failover is now ready again

What do you think?

Cem

tlarkin
Honored Contributor

Do your clients connect to the JSS by IP or by FQDN? The only problem I see is, that if you do it by IP and the IPs are not the same, the client will not know where to check in.

donmontalvo
Esteemed Contributor III

Hi Cem,

I remember we only had to schedule off hours maintenance. Then we just powered down the Fail Over server, and powered back up the Master. Then the Fail Over server went back to sitting there with DHCP address, waiting for another failure so it can jump in again. I was happy to have convinced the business to fork over enough for Fibre connected RAID, otherwise the Fail Over wouldn't have been totally transparent. :)

AFP3 has a 120 second window for auto-reconnect, not sure how this plays into the way JSS uses AFP for Distribution Server stuff. Any mounts from the Master will attempt to reconnect for up to 120 seconds, which should be plenty of time for the Fail Over server to kick in.

Thanks,
Don

--
https://donmontalvo.com

donmontalvo
Esteemed Contributor III

Fail Over covers this, whether the clients connect by IP or DNS, the Fail Over grabs the IP, and thus the DNS.

Don

--
https://donmontalvo.com

tlarkin
Honored Contributor

So, when the fail-over machine boots it has the same IP/DNS as the Master JSS? The client keeps all that info in the /etc/jamf.conf file and if it doesn't match they won't check in. Which is something I'd like to see in the future of Casper, where you can list multiple servers and set priority.

donmontalvo
Esteemed Contributor III

Yep, the Fail Over server sits there with a DHCP, waits for heartbeat to stop, then grabs the IP and assumes the Master role. For our Santa Fe client, the COO actually pulled the plug on the Master (one of the IT guys nearly fainted) and the Fail Over took all of 3 seconds. One setting you'll want to disable is "Power up after power interruption" in Energy Saver. This way the Master stays offline until you're ready to bring it back up.

Don

--
https://donmontalvo.com

tlarkin
Honored Contributor

I need to set this up for our fail over Open Directory Master, but I don't have an extra box lying around I can keep idle if the JSS goes down.

Cem
Valued Contributor

I strongly agree with to disabling "Power up after power interruption" as
only way to access to your Master via LOM after fail over...if you have only
one IP of course...otherwise you will be calling your data centre
c

Cem
Valued Contributor

I have been reading this a lot and it is not recommended for OD. Unless you are running homefolders from the same server (also not recommended)
OD has Replica Technology perfectly delivers failover scenario.

On 01/06/2010 15:56, "Thomas Larkin" <tlarki at kckps.org> wrote:

I need to set this up for our fail over Open Directory Master, but I don't have an extra box lying around I can keep idle if the JSS goes down.

donmontalvo
Esteemed Contributor III

Hi Tom,

Do you need to do Fail Over on OD? Clients should reroute to the Replicas, no?

Don

--
https://donmontalvo.com

donmontalvo
Esteemed Contributor III

Hi Cem,

I see what you mean. If the Master comes back up after Fail Over, you may end up with two boxes fighting for the same IP address.

Don

Don

--
https://donmontalvo.com

Cem
Valued Contributor

After failover Master server shows the error that its IP being used and
simply it can not connect to network.
I have tried to access to Masters via ARD or ssh but I turned up connecting
to BackUp server. This happened using fixed IP and using dhcp on both
servers.

Cem

tlarkin
Honored Contributor

The problem is, home folders are on that server, so I need an exact copy of it. I wish we could migrate home folders to an XRAID, RAID 5 unit on fibre switches but there is no budget for that. If the ODM goes down and someone tries to log into a machine they have not logged into yet (since we do PHDs) they cannot log in because they cannot access their home folder.

Also, I have been lucky enough to never have to test the fail-over, but I don't think we have it set up here. We have 1 ODM 6 T1 replicas and 12 T2 replicas.

donmontalvo
Esteemed Contributor III

Hi Cem,

The Master can't connect because the Fail Over box has assumed the IP address. This is expected behavior, and why we schedule off hour maintenance. We can then shut down the Fail Over, bring up the Master, then when the Fail Over box comes back up, it resumes as DHCP.

Don

--
https://donmontalvo.com

Cem
Valued Contributor

In this scenario, has Master Server has dhcp or fixed ip? As far as my
environment concern I can not get FQDN if I don't have a fixed IP address.

We can then shut down the Fail Over, bring up the Master, then when the Fail Over box comes back up, it resumes as DHCP.

If Master has fixed IP, Failover Server will force the Master Server out the
network again after being rebooted. That is what I have monitored yesterday.

Sorry repeat, that is why I have come up with this plan of action;

Master Server fails

-->Backup Server takes IP over automatically (because Casper database
already rsynced it is up to date and service won't be interrupted)

-->Master Server repaired or recovered, but cannot use its Unique IP

-->Remove IP Failover info from /etc/hostconfig from Backup Server -->reboot

-->Boot or reboot Master Server (after repair or recovery) IP is now useable

-->Add the info back into /etc/hostconfig file on Backup Server-->reboot

-->IP Failover is now ready again

Cem

donmontalvo
Esteemed Contributor III

Hi Cem,

Once the Fail Over box is taken down, the Master would be brought back up with it's original static IP address (nothing changes). Perform any maintenance needed on the Master, then bring the Fail Over box back up. The Fail Over would need to come back up as DHCP (don't remember if we had to do this last step manually).

Don

--
https://donmontalvo.com

donmontalvo
Esteemed Contributor III

Cem,

I dug up my notes, the Fail Over will in fact release the IP address once it is brought back up and notices the Master is up. I'll send you my notes in a separate email.

Don

--
https://donmontalvo.com

Cem
Valued Contributor

Hi,

OSx10.6 Server IP Fail Over is broken. But I have managed to fixed it. If
anyone interested, here is how its done:

I have created Automator app with Unix command and added in login items of
Master Server. (Launchd (Lingon) didn't work for me)...as heartbeatd wasn't
sending pulses...
Here is the command;

heartbeatd -d fwIPAddress serverIPAdress
(example: heartbeatd -d 10.0.0.2 192.168.0.2 )

I now fully automated Ip Failover with mail notifications (notifies failing
and coming back IPs). Data Center guys just need to put Master server back
on :)

This is what it says on Terminal -d, --debug as -x, but print debug output to terminal

And more here on man page:
http://developer.apple.com/mac/library/documentation/Darwin/Reference/ManPag
es/man8/heartbeatd.8.html

Next challenge is get syncing Casper MySQL db to backup server, so JSS keeps
running after FailOver. I will keep you all updated.

Cem

donmontalvo
Esteemed Contributor III

Now that we have our isolated LAB environment built here in Dallas, one of the things we're going to be testing is running MySQL (VM'd on our LAB Xserve for now) to see if we can offload the MySQL database from the Xserve. This would make it easier to get another box up and running. When we imported our database, the compressed database was 2.4G and took several hours to expand...and many more hours to import. We're hoping to never have to go through that process again. :)

Don

--
https://donmontalvo.com