Skip to main content

Hi Guys,



What is the best practice for JSS Fail Over, we are looking for automated fail over solution.....any suggestions welcome.



Cheers



Cem



-----------------------------------------

Hi Cem,



I set up a High Availability (HA) configuration for one of our Santa
Fe customers a few years ago. I was able to set up two Xserves running
10.4, with heart beat failover. We had a fiber attached RAID that
needed to be scripted to switch to the failover server as well, but
I'm pretty sure if this is for JSS it's not needed. Here's a link to
the 10.4 HA doc, sure you'll find updated docs if you surf around a bit.



http://manuals.info.apple.com/en/High_Availability_Admin_v10.4.pdf



PS, not sure if my 10.4 notes would help, happy to share if you
contact me offlist.
PS2, feature request for JAMF...support for running JSS on Windows (so
our datacenter folks can support hardware, OS and backups)



Don


Hi,



What kind of failover ? DP or the Web service,



I raised this is JAMF as a feature request as there is no built in failover redundancy apart from a backup and a restore of the database, which takes user interaction



Regards



Criss



Criss Myers
Senior Customer Support Analyst (Mac Services)
iPhone Developer
Apple Certified Technical Coordinator v10.5
LIS Development Team
Adelphi Building AB28
University of Central Lancashire
Preston PR1 2HE
Ex 5054
01772 895054


Hi Criss,



We are looking for complete easy way of replacing the server box if it is packed-up. This will be done by team that they have no Mac knowledge and it needs to be complete swap out or just automated take over by the second server.



Cem


The problem with JSS failover is that you would have to have a second server, with the same IP/DNS because the client will not know where to look. So I guess you can have a second box set up and powered off with the same IP/DNS and if the primary JSS fails you can boot up the fail over and let it take over.



Another option would be to mass edit the /etc/jamf.conf file which stores the JSS info locally via ARD Admin or something else.



-Tom


First solution still needs tweaking as database not synced.



Second solution requires over 540 clients to be accessed via ARD.



...considering none Mac guys needs to handle this action, both are a bit
involved.



But it can be 3rd solution;



What if I put second Server as Target Mode via FireWire and run Carbon Copy
Cloner on daily basis (around 3am after database backup). This way it will
be exact copy cat of the main server. If the server fails, just switch off
and on the second server. Just a thought...what do you think?



Cem


I don't know about anyone else, but that seems like a lot of heavy drive
work. I know MTB on a hard drive is up there, but wiping the drive every
night and putting a complete clone on the system seems like overkill.



Why not use DNS instead to handle the IP address? A quick edit of a DNS
record will generally take only a few minutes. If the TTL on your DNS
server is at 1 hr for internal records, the most you'd be down is that long.



Setup a script to sync the database backup over to the backup JSS box with a
launchd task set to shutdown mySQL, restore the database, and bring mySQL
back up on the backup box.



Failure of the primary JSS happens, edit the DNS records and you are back up
and running.



Just my two pennies.



Steve Wood
Director of IT
swood at integer.com



The Integer Group | 1999 Bryan St. | Ste. 1700 | Dallas, TX 75201
T 214.758.6813 | F 214.758.6901 | C 940.312.2475


That sounds good already.



Setup a script to sync the database backup over to the backup JSS box with a launchd task set to shutdown mySQL, restore the database, and bring mySQL back up on the backup box.



Only thing is I am not a script head. Steve do you have any idea where can I get hold of such a script?



Also I would like to ask a question to Thomas Larkin; have you got a website for scripts that you are kindly sharing? I may need the one for partitioning the HDD.



Thanks in advance



Cem


To be honest, I'm not a big mySQL head, so I'm not 100% certain how to
accomplish this, but something like:



#########



#!/bin/bash



# shutdown mysql
/usr/local/mysql/bin/mysqladmin shutdown



# copy over the backup from primary server



scp primaryjss.mycompany.com:/some/file/path/jsssqlbackup.gz
/my/local/path/to/sql



# unzip backup



gunzip jsssqlbackup.gz



# restart mysql
/usr/local/mysql/bin/mysqladmin start



exit 0



########



Again, not being a mySQL geek I'm not sure it would work, but this concept
should work. And of course, there is no error checking in that, and you'd
probably need to setup shared SSH keys so that the scp command would work.



Not sure if this method will work for sharing the ssh keys, but here is an
article that is up on AFP548 from a few years ago:



http://www.afp548.com/article.php?story040816224717742&query=ssh%2Bkeys



<http://www.afp548.com/article.php?story040816224717742&query=ssh%2Bkeys>



I'm sure someone else on list might be able to clean it up some.



As for the partiitioning scripts that Thomas has, there was another thread
going that I sent those scripts out to. Check the list archive from today.



Steve Wood
Director of IT
swood at integer.com



The Integer Group | 1999 Bryan St. | Ste. 1700 | Dallas, TX 75201
T 214.758.6813 | F 214.758.6901 | C 940.312.2475


Thanks Don,



Actually Failover IP will be be the best solution with incremental backup
using rsync or CCC. (I wonder CCC will work)



Anyone out there tried this with JSS Server?



Here is the latest doc;
http://manuals.info.apple.com/en_US/File_Services_Admin_v10.5.pdf



It doesn't mention anything in 10.6 Server Manual, but my Apple contact
confirmed nothing has been changed.



Cem


Hi Cem,



A belated thank you for the 10.5 doc. I was pulling my hair out
looking for a 10.5 version of HA, looks like I was looking in the
wrong place! I'll need to refresh myself with this, since I'll need to
do this in the coming weeks for one environment.



Don


Hi,



I have been spending half of the day to get this IP Failover working for my
new Intel MacOSx10.6.3 Servers....but no luck



It takes over and sends the notification email correctly. But when the
primary server is up and running again. The backupserver sees the running
primary server and still tries to get its IP. This puts the the Primary
Server off the network with error message "Another device on the network is
using your computer's IP address"



Am I doing something wrong?? I have followed both Command Line Admin and
File Services Admin Manuals step by step. Also asked my best friend
Google...but no luck



I am sure there is a hero somewhere to help.



Cheers
Cem


When we worked with an Apple SE to get HA in place, there was little
documentation (10.4 days). We were successful in getting Fail Over to
work on te server and Fibre switched RAID, but I do remember the same
frustration. Fail Back was a manual procedure that had to be done off
hours. Have you engaged Apple to see what they recommend?



Don


Apple told me this is expected behaviour.



If I have to do it manually, this is what I am thinking;



Master Server fails



-->Backup Server takes IP over automatically (because Casper database
already rsynced it is up to date and service won't be interrupted)



-->Master Server repaired or recovered, but cannot use its Unique IP



-->Remove IP Failover info from /etc/hostconfig from Backup Server -->reboot



-->Boot or reboot Master Server (after repair or recovery) IP is now useable



-->Add the info back into /etc/hostconfig file on Backup Server-->reboot



-->IP Failover is now ready again



What do you think?



Cem


Do your clients connect to the JSS by IP or by FQDN? The only problem I see is, that if you do it by IP and the IPs are not the same, the client will not know where to check in.


Hi Cem,



I remember we only had to schedule off hours maintenance. Then we just
powered down the Fail Over server, and powered back up the Master.
Then the Fail Over server went back to sitting there with DHCP
address, waiting for another failure so it can jump in again. I was
happy to have convinced the business to fork over enough for Fibre
connected RAID, otherwise the Fail Over wouldn't have been totally
transparent. :)



AFP3 has a 120 second window for auto-reconnect, not sure how this
plays into the way JSS uses AFP for Distribution Server stuff. Any
mounts from the Master will attempt to reconnect for up to 120
seconds, which should be plenty of time for the Fail Over server to
kick in.



Thanks,
Don


Fail Over covers this, whether the clients connect by IP or DNS, the
Fail Over grabs the IP, and thus the DNS.



Don


So, when the fail-over machine boots it has the same IP/DNS as the Master JSS? The client keeps all that info in the /etc/jamf.conf file and if it doesn't match they won't check in. Which is something I'd like to see in the future of Casper, where you can list multiple servers and set priority.


Yep, the Fail Over server sits there with a DHCP, waits for heartbeat
to stop, then grabs the IP and assumes the Master role. For our Santa
Fe client, the COO actually pulled the plug on the Master (one of the
IT guys nearly fainted) and the Fail Over took all of 3 seconds. One
setting you'll want to disable is "Power up after power interruption"
in Energy Saver. This way the Master stays offline until you're ready
to bring it back up.



Don


I need to set this up for our fail over Open Directory Master, but I don't have an extra box lying around I can keep idle if the JSS goes down.


I strongly agree with to disabling "Power up after power interruption" as
only way to access to your Master via LOM after fail over...if you have only
one IP of course...otherwise you will be calling your data centre
c


I have been reading this a lot and it is not recommended for OD. Unless you are running homefolders from the same server (also not recommended)
OD has Replica Technology perfectly delivers failover scenario.



On 01/06/2010 15:56, "Thomas Larkin" <tlarki at kckps.org> wrote:



I need to set this up for our fail over Open Directory Master, but I don't have an extra box lying around I can keep idle if the JSS goes down.


Hi Tom,



Do you need to do Fail Over on OD? Clients should reroute to the
Replicas, no?



Don


Hi Cem,



I see what you mean. If the Master comes back up after Fail Over, you
may end up with two boxes fighting for the same IP address.



Don



Don


After failover Master server shows the error that its IP being used and
simply it can not connect to network.
I have tried to access to Masters via ARD or ssh but I turned up connecting
to BackUp server. This happened using fixed IP and using dhcp on both
servers.



Cem


The problem is, home folders are on that server, so I need an exact copy of it. I wish we could migrate home folders to an XRAID, RAID 5 unit on fibre switches but there is no budget for that. If the ODM goes down and someone tries to log into a machine they have not logged into yet (since we do PHDs) they cannot log in because they cannot access their home folder.



Also, I have been lucky enough to never have to test the fail-over, but I don't think we have it set up here. We have 1 ODM 6 T1 replicas and 12 T2 replicas.