Yo people be careful, the following definitely falls into the "Don't try this at home kids" line.
My RAID1 array failed; long story short, I have a crappy, expensive, full-featured ASUS motherboard that's 13 months old that came with a 12 months warranty; Now stuff is imploding from everywhere, this piece of hardware went to the workshop three times already: PSU failure & PCI Express Port failures.
3 days ago, the computer booted up with only 1 out of 3 hard drives (the last one), and booted only on a spare root slackware recovery partition that I had luckily lying around this third drive. I am assuming SATA ports failures this time. Why not, it fits the picture.
My drive geography is as follow:
SIZES Drive SDA Drive SDB Drive SDC sdX1 100GB DATA SPARE CRITICAL DATA in RAID1---CRITICAL DATA in RAID1 sdX2 1GB SWAP 1 SWAP 2 SWAP 3 sdX3 19Gb Debian Fedora Slackware sdX4 30GB Slack Home Debian Home Fedora Home
The distro I use for real is Fedora 7; Debian is an old ubuntu-something that I haven't fired up in ages, and the Slackware is my trusty recovery tool (faster than a live disc)
I am not using /home much, everything is on the sdb1/sdc1 RAID1 array, so to be available from all distros; it's the place for all pics, music, movies and for backups of my work's flashdrives + mails; it also holds main distros' .iso files and some more K3b .iso's from work that I cannot carry on my flash.
As said above, one day my fedora system would not boot, with errors messages of the system trying to hook the two first disks (sda & sdb) at different sata speeds and continuously failing; I achieved to boot slackware with lots of patience, only in root mode since it's /home partition on sda4 wasn't available; from there the first thing I did was copying all to an external USB HDD; I sure planned to have a serious look at this hardware but first, BACKUP!
If you ever went to the AnAnA repair shop, you'd understand me: the experience of looking at their faces when you explain them, on a faulty PSU unit that need to be repaired, that yes your drives contain critical data from work isn't pleasant at best. It's written in their fine prints that they do not take responsibility anyway.
So then, still, +1 for linux: I was able to recover data, it was "safe" on the usb drive, I can start to tackle the issue; basically, I even still had a functional computer with an OS on it, even crippled by 66%.
First, let's do it the Caveman-Way, myself since that's all the AnAnA boyz are going to do anyway: split the case open, un-wire everything inside, clean all connectors both on boards an on wires, and put them back, firmly and tightly, in a different order (I am not using UUID in GRUB or FSTAB, so it's still "in the same order" only I have switched all wires & changed the power output I use for the disk drives).
reboot ... works fine no error messages, no hangs on sata connection... ... Wait!
Stuff seems fine, until I noticed my two sdb1/sdc1 drives are out of sync since I "touched" the last one while fiddling in Slackware; Auto-recovery, Auto-sync doesn't seems to be on the menu, so my system shamelessly start a RAID array on... One drive. I understood later: to speed up backup, I did delete some old-old redundant backups eating much space, so the out-of-sync feature is, well, quite normal.
Keep Cool, man, you can't kill the mdadm dev from 8.000m miles away anyway.
Repair process, lessons learned & other bits of CLI stuff
HOW would you notice you are running a half-broken RAID array? At startup time, the system will print something like: mdadm: Device /dev/md0 started with 1 (out of 2) drives
If your /ect/mdadm.conf files says "partitions" rather than a specific list of drives, then mdadm reads /proc/partitions
[root@DC266 etc]# less /proc/partitions major minor #blocks name  8 16 156290904 sdb  8 20 31455270 sdb4 8 32 156290904 sdc  8 36 31455270 sdc4 9 0 104856128 md0
fdisk -l >> ok, everybody is listed, filesystems match Device Boot Start End Blocks Id System  /dev/sdb1 1 13054 104856223+ fd Linux raid autodetect  /dev/sdc1 * 1 13054 104856223+ fd Linux raid autodetect  Disk /dev/md0: 107.3 GB, 107372675072 bytes 2 heads, 4 sectors/track, 26214032 cylinders Units = cylinders of 8 * 512 = 4096 bytes Disk /dev/md0 doesn't contain a valid partition table >> That's normal, yep it is!
[root@DC266 ~]# mdadm --detail --scan /dev/md0 /dev/md0 : Version : 00.90.03 Creation Time : Sun Sep 30 06:14:31 2007 Raid Level : raid1 Array Size : 104856128 (100.00 GiB 107.37 GB) Used Dev Size : 104856128 (100.00 GiB 107.37 GB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Tue Oct 14 09:54:22 2008 State : clean, degraded >> One drive is missing Active Devices : 1 Working Devices : 1 Failed Devices : 0 << but it's nowhere to be seen! Spare Devices : 0 UUID : 1a654dbc:e06168d5:e400185b:ea8dcee9 Events : 0.5894 Number Major Minor RaidDevice State 0 0 0 0 removed 1 8 17 1 active sync /dev/sdb1 >> We should see /dev/sdc1 here
[root@DC266 ~]# cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdb1 104856128 blocks [2/1] [_U] unused devices: <none>
[root@DC266 ~] less /etc/mdadm.conf # mdadm.conf written out by anaconda DEVICE /dev/sd[ab]1 >> That was a big f**cking mistypo, when rectified, it changed nothing. MAILADDR root ARRAY /dev/md0 level=raid1 num-devices=2 devices=/dev/sdb1,/dev/sdc1
IF you happened to modify things on one drive during recovery attempt, then unmount the drives and do:
dd if=/source of=/target
Real world talk, I did:
dd if=/dev/sdc1 of=/dev/sdb1
To get the most actual data, the one I was able to access and clean for backup, in sync with the missed drive, sdb1, before attempting to re-do the raid1 array.
The following line is what to do from a 'blank' state, from a ystem which has never touched/created the array
[root@DC266 ~]# mdadm --assemble --verbose /dev/md0 /dev/sdb1 /dev/sdc1 mdadm: device /dev/md0 already active - cannot assemble it
(seeing this afterwards points me to think that I should have stopped the one-armed array with mdadm -S /dev/md0)
The line above works if on an undamaged partition/fresh linux install where the 2 drives are recognized as a RAID partition but were not mounted/assembled at boot - I use this command to get my /dev/md0 back after new installs.
EDIT: the following concern a disappeared array after, for instance, fooling around with gparted (10/03/2009, slack 12.2 on DC266) - the actual data on disc is good, they are in sync, but you, for whatever fucking reason, have to rebuild it manually because you did not change it's components, locations but, in this very case, it's block size:
OK then: destroy all lines in /etc/mdadm.conf and reboot; you "don't" have a RAID array anymore; follow this by:
[root@DC266 ~]# mdadm --create -l raid1 /dev/md0 -n 2 /dev/sdb1 /dev/sdc1
It gave me that with Fedora7:
mdadm: /dev/sdb1 appears to contain an ext2fs file system size=104856128K mtime=Tue Oct 14 13:50:38 2008 mdadm: /dev/sdb1 appears to be part of a raid array: level=raid1 devices=2 ctime=Sun Sep 30 06:14:31 2007
mdadm: /dev/sdc1 appears to contain an ext2fs file system size=104856128K mtime=Tue Oct 14 09:42:29 2008 mdadm: /dev/sdc1 appears to be part of a raid array: level=raid1 devices=2 ctime=Sun Sep 30 06:14:31 2007
Continue creating array? yes mdadm: array /dev/md0 started.
[root@DC266 ~]# cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdc1 sdb1 104856128 blocks [2/2] [UU] [==========>..........] resync = 50.0% (52502912/104856128) finish=40.9min speed=21324K/ unused devices: <none>
It gave me that with Slackware 12.2:
root@DC266:/# mdadm --create -l raid1 /dev/md0 -n 2 /dev/sdb1 /dev/sdc1 mdadm: /dev/sdb1 appears to contain an ext2fs file system size=155236060K mtime=Tue Mar 10 13:01:19 2009 mdadm: /dev/sdc1 appears to contain an ext2fs file system size=155236060K mtime=Mon Mar 9 18:05:19 2009 Continue creating array? y mdadm: array /dev/md0 started.
I think this should be because of gparted rewriting the table or wathever; you'll notice that it doesn't output the bit about being part of a raid array. I did notice in the past that md devices doesn't appear in gparted or appear but cannot be handled by it. Don't ask, since libparted is the same tool that allow you to build raid arrays during installs anyway.
root@DC266:/# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] md0 : active raid1 sdc1 sdb1 155235968 blocks [2/2] [UU] [====>................] resync = 24.5% (38177728/155235968) finish=50.2min speed=38782K/sec unused devices: <none>
Better to have a backup of mdadm.conf... You never know... Warning: resync is a very slow process, and very much so if you have big drives! Be patient!
Failure again, boot up sequence broken, mdadm fails; jumps to recovery mode because of bad superblock or whatever on the raid array;
I used slackware to get to the / partition of Fedora and removed the relevant line in /etc/fstab;
rebooted Fedora, killed ALL mdmonitor entries in rc.d and rc1.d and so on (find them with locate mdadm and locate mdmonitor)
re-build my own mdadm.conf file in arch-classic version as follow: (it failed with uuid)
# mdadm.conf in manual stripped-down recovery version DEVICE /dev/sd[bc]1 MAILADDR root ARRAY /dev/md0 level=raid1 num-devices=2 devices=/dev/sdb1,/dev/sdc1
Then again, this time I rebuild the array, first stopped with:
mdadm -S /dev/md0
&& followed with the instruction to build the RAID according to the default config file:
mdadm -As /dev/md0 << -A stands for Assemble; -s is not in the --help-options page.
Reboot despite having killed all mdmonitor files, the build-in kernel support for software raid kicks in anyway, but this time I make it through and I am greeted with /dev/md0 started with 2 drives - but wait, this is all with the drive out of the fstab file, one more reboot is needed.
Finger crossed. Why does that procedure create an md0 device at the root of my drive??
Time to "defeat" the issue: 2+ full days (Down to 30 minutes this last time... ) - Data Lost: None - Fear Factor: DataLoss-threat trouser state officially raised from "damp" to "brown" - system is currently: 80% operational. Until further notice, that is.
Terminal Issue // RANT MODE
Rebooting the system as explained, I faced this complete crash explained, with a System Hang on Failure to load/build/whatever /dev/md0; the system goes then in recovery mode where the / directory is mounted read-only so you can't get the /dev/md0 line out of the fstab file. That's so fuckingly brilliant that I seriously think of either buying a Vista license or throw all computers away and get MacBook Pros instead.
While recovering (again) with slackware, I noticed I have a PERFECTLY HEALTHY RAID ARRAY there, /dev/md0 mounts without failure.
So on top of it all, it's not even a "hard"failure; it's actually Fedora that's getting in the way.
I've got to try
mount -n -o remount,rw
- mdadm seems to lack auto-recovery, auto-resync features,
- if you have trouble, individual parts of the array can be mounted manually, more easily even if you cleared the /etc/mdadm.conf file
- After a recovery from failed array operation, the likes of --assemble didn't worked here, hence the Caveman-Way:
Delete the previous array, start a new one
- Empty mdadm.conf, (As of now, it is rebuild in "classic" version as explained earlier) (??)
- e2fsk all disks of your array with, as root, e2fsk /dev/sdX,
- mdadm --create -l raid1 /dev/md0 -n 2 /dev/sdb1 /dev/sdc1 (modify to suit your needs)
Details: mdadm is the invoked software --create just do that, it creates the array without touching the data on them, and does a re-sync. -l stands for Level, the type of raid array you want to create raid1 is the type, in this case a perfect mirror of two discs /dev/md0 is the name of the created special raid device that you will (hopefully) be able to mount later -n 2 is the Number of drives (or partitions) involved /dev/sdb1 /dev/sdc1 are my selected raid members
- That's all folks, now wait for the discs to have re-synced, & will check if anything at all still resides on these partitions afterward...
- If needed, you can do
mdadm -S /dev/md0
mdadm -As /dev/md0
If you have a mdadm.conf file that is functional.
(EDIT: DATA UNTOUCHED, IT WORKS _HERE_ but you can you get it into fstab or it will fail again ?)
[root@DC266 ~]# mdadm --detail --scan /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Tue Oct 14 14:30:24 2008 Raid Level : raid1 Array Size : 104856128 (100.00 GiB 107.37 GB) Used Dev Size : 104856128 (100.00 GiB 107.37 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Tue Oct 14 14:30:24 2008 State : clean, resyncing Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Rebuild Status : 16% complete << It's re-sync'ing! hurrah! Be patient! UUID : 8cfb903f:7c1fa74b:c7d50d4b:b66cff8a Events : 0.1 Number Major Minor RaidDevice State 0 8 17 0 active sync /dev/sdb1 1 8 33 1 active sync /dev/sdc1
Then don't forget to mount it manually, even re-create it when needed
Room for improvement
Get one fucking straight repair method.