Subject: Re: 3rd Master Hard Disk Error
Date: Thu, 22 Nov 2018 15:10:32 -0800
Anyway one of those "fix" stories I have from earlier this year is pretty
interesting - may pass that along, ... but alas, that was work goop, so
would need to redact at least some (notably identifying) bits of it (but
most is sufficiently generic it doesn't matter).
So ... bit of background.
An f5 "appliance" - this particular case and "appliance", the f5 ...
actually a pair of them - one specifically addressed here - the nominal
standby), and the other the active primary. They're Linux based devices,
but ... "appliance" - most of that Linux stuff relatively "hidden",
behind-the-scenes, but ... actually accessible. And, "of course",
all of f5's stuff (software, etc.) layered atop that.
In this particular case, non-production ("lab"), and suffering some
neglect (priorities/resources) and failure(s), we're in situation where
we've got nominally pair of mirrored disks on the unit ... but one has
failed (actually quite a while earlier), and has been replaced (so
effectively no data on the replacement drive), but the other is also
giving hard failures(!). With the support from f5, we'd gotten to the
point where they're basically, "Yeah, you need to rebuild that from
scratch/backups - can't access/recover the data on there."
Myself, of course, knowing it's all atop Linux, I decide to dig a bit
and see if I can recover/fix it without too much difficulty, and avoid
the hassle/complexity (fair number of manual steps and time) to rebuild
and restore from backup(s), etc.
Also, Linux *based* ... but lots of f5 stuff atop that, so ... one can't
presume *too* much about it. E.g. *reading* data is generally fine,
but changing stuff can be (potentially very) hazardous - e.g. if one
changes thing(s) at Linux layer that f5 would expect to be changed
through the f5 interfaces, chaos may ensue - as the f5
software/configurations wouldn't know about or be expecting those
changes. So, one needs be quite mindful of that and sufficiently
careful. Anyway, notwithstanding that major caveat, one can learn, at
the Linux layers, much of what the "appliance" is doing, how, some of
where it is/isn't healthy, etc. And, in some cases, potentially -
carefully - fix/change *some* things. Also, this being the nominal
standby unit, rather than active, we've got much more flexibility to
bring things down, etc. - so that makes it quite a bit easier to deal
with.
Also, these particular f5 "appliances" - they're physical x86 based
hardware units. So there's a physical host at that level. They *also*
support virtualization, so there are *also* "guest" Virtual Machines
(VMs) running atop the hardware - again Linux-based, with f5 stuff atop
that - on both the host running direct on physical, and the "guest" VMs
within - which in most regards look very highly similar to the host
on physical - including all the f5 bits within (just doesn't have yet
another VM layer down underneath the guests - so no guests of guests).
So, again, the context - f5 - failed/failing hard drive (spinning rust),
nonrecoverable error(s), but otherwise (mostly) operational (at
Operating System (OS) level), etc. Two hard drives, nominally mirrored,
the "good" drive is effectively without data - it's a replacement that
was earlier installed for other disk that had much earlier failed
and was nowhere near current on the mirroring anyway. So at this
point essentially one good drive without data, and one problematic
drive (throwing hard errors).
Also, slight aside - folks will reasonably "disagree" / have differing
perspectives on complexity and it's disadvantages and (sometimes)
advantages (what exactly was gained by bringing in that complexity?).
The example here has much complexity (partitions, mdraid, LVM, virtual
machines, f5 vendor layer, and also most all that complexity also within
VM guests) - and that complexity has both disadvantages (much to know /
dig through), and also advantages too (e.g. able to "drill down",
isolate & correct problem - and problem impact also quite limited in
scope due to its isolation quite far down in the layers).
Without further introduction, and slightly redacted:
From: Michael Paoli [REDACTED]
Date: Mon, Aug 6, 2018 at 11:48 AM
Subject: Re: [E] RE: Regarding Service Request #[REDACTED] | | Follow
up from RMA [REDACTED] & [REDACTED]
To: <[REDACTED]@f5.com>
Cc: <[REDACTED]@f5.com>
Poked at it a bit over the weekend - remotely ...
was able to get the data recovered & all remirrored okay, without
rebuild/reinstall.
We'll still probably want to go ahead and replace the disk that had
the read failure, but at present I'm moving on to getting the
[REDACTED] unit properly remirrored, then will circle back to
replacing the quite suspect hardware (disk(s)) that still ought be
replaced ... even if they may be working at the moment (notably ones
that had hard failures earlier).
The short, and longer versions of the recovery without rebuild, in
case you're interested/curious:
basically managed to recover the disk that was giving unrecoverable read
errors, and get it remirrored onto replacement drive - using lower level
Linux bits (not fully sanctioned f5 approach, but fully doable ... so
long as we're sufficiently careful as to not conflict with any f5
specific bits).
[REDACTED]
unrecoverable read errors seen on:
/dev/md16 /shared/vmdisks
/dev/md16 835G 14G 780G 2% /shared/vmdisks
e.g.:
md/raid1:md16: dm-2: unrecoverable I/O read error for block 12812032
sd 0:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sda, sector 35690808
on /shared/vmdisks filesystem, very few files and directories,
no directory read errors,
file read error only on:
[REDACTED].img
all other files read fine
*.img files used by guests (as shown by VM PID(s) having them open as
seen via fuser)
disabled guests until no PIDs had those *.img files open
can't copy full [REDACTED].img due to read errors
/shared/vmdisks filesystem is ext3 so journaling is used (overwrite of
file may be rewritten to different blocks rather than in place, and
probably is) can we read the *data* within [REDACTED].img (at least that's
actually used?)
# losetup -f [REDACTED].img
# losetup -a
/dev/loop0: [0910]:6436 ([REDACTED].img)
is image partitioned?
don't have partx ....
# sfdisk -uS -l /dev/loop0
Disk /dev/loop0: cannot get geometry
Disk /dev/loop0: 13054 cylinders, 255 heads, 63 sectors/track
Warning: The partition table looks like it was made
for C/H/S=*/16/63 (instead of 13054/255/63).
For this listing I'll assume that geometry.
Units = sectors of 512 bytes, counting from 0
Device Boot Start End #sectors Id System
/dev/loop0p1 * 1 460655 460655 b W95 FAT32
/dev/loop0p2 460656 2557295 2096640 82 Linux swap / Solaris
/dev/loop0p3 2557296 209713391 207156096 8e Linux LVM
/dev/loop0p4 0 - 0 0 Empty
#
Yes it's partitioned ...
What data within do we care about?
boot area +MBR (before start of earliest partition)
data within filesystems (not counting slack space) but also including
early space within filesystem (possibly boot data or similar)
don't care about slack space within
Which can/can't we read (and thus copy) and which are or may be
problematic?
Let's try straight copy, see how far we get (expecting it to fail)
# cp -p [REDACTED].img [REDACTED].img.COPY || ls -l [REDACTED].img*
cp: reading `[REDACTED].img': Input/output error
-rw-r--r-- 1 root root 107374182400 Aug 3 16:31 [REDACTED].img
-rw------- 1 root root 49137192960 Aug 3 16:51 [REDACTED].img.COPY
#
That's almost half - should have up through first 2 partitions, and fair
bit of 3rd, but not complete
So, what we're missing in terms of actual data, is (or may at least
include) stuff within 3rd partition ... but that's LVM ... what do we
actually have in there (and how much used and not?)
let's change our existing loop to ro:
# losetup -d /dev/loop0 && losetup -r -f [REDACTED].img; losetup -a
/dev/loop0: [0910]:6436 ([REDACTED].img)
#
The losetup we have doesn't have --sizelimit, but it does have
-o (offset)
# expr 2557296 \* 512
1309335552
# losetup -d /dev/loop1
# losetup -r -f -o 1309335552 [REDACTED].img && losetup -a
/dev/loop0: [0910]:6436 ([REDACTED].img)
/dev/loop1: [0910]:6436 ([REDACTED].img), offset 1309335552
#
# pvscan -u
...
PV /dev/loop1 with UUID wZSZUl-LrU9-kBIg-2EkM-G4wg-4nqt-J6FYVI VG
vg-db-vda lvm2 [98.78 GB / 51.91 GB free]
...
Here's a very interesting bit:
# ls -alsh [REDACTED].img
4.6G -rw-r--r-- 1 root root 100G Aug 3 16:31 [REDACTED].img
So ... (very) sparse file - mostly unallocated blocks.
GNU cp has an option to be highly efficient about sparse copies ...
but ... we wouldn't get a read error from a non-allocated block,
so, somewhere in the ~4.6G are read error(s)
VG name conflicts with existing so ...
# vgimportclone -n vg-db-vda.[REDACTED] -i /dev/loop1
WARNING: Activation disabled. No device-mapper interaction will be
attempted.
/tmp/snap.iWI12298/vgimport0: write failed after 0 of 4096 at 4096:
Operation not permitted
pv_write with new uuid failed for /tmp/snap.iWI12298/vgimport0.
0 physical volumes changed / 1 physical volume not changed
Fatal: Unable to change PV uuid for /tmp/snap.iWI12298/vgimport0, error: 5
#
but need to be rw to import ...
# losetup -d /dev/loop1 && losetup -f -o 1309335552 [REDACTED].img &&
losetup -a
/dev/loop0: [0910]:6436 ([REDACTED].img)
/dev/loop1: [0910]:6436 ([REDACTED].img), offset 1309335552
#
# vgimportclone -n vg-db-vda.[REDACTED] -i /dev/loop1
WARNING: Activation disabled. No device-mapper interaction will be
attempted.
Physical volume "/tmp/snap.BnJ12488/vgimport0" changed
1 physical volume changed / 0 physical volumes not changed
WARNING: Activation disabled. No device-mapper interaction will be
attempted.
Volume group "vg-db-vda" successfully changed
Volume group "vg-db-vda" successfully renamed to "vg-db-vda.[REDACTED]"
Reading all physical volumes. This may take a while...
Found volume group "vg-db-sdb" using metadata type lvm2
Found volume group "vg-db-sda" using metadata type lvm2
Found volume group "vg-db-vda.[REDACTED]" using metadata type lvm2
Found volume group "vg-db-cpmirror" using metadata type lvm2
# vgchange -a y "vg-db-vda.[REDACTED]"
16 logical volume(s) in volume group "vg-db-vda.[REDACTED]" now active
# lvs | fgrep vg-db-vda.[REDACTED] | sed -e 's/[ ]*$//'
dat.log.1 vg-db-vda.[REDACTED] -wi-a- 7.00G
dat.maint.1 vg-db-vda.[REDACTED] -wi-a- 300.00M
dat.share.1 vg-db-vda.[REDACTED] -wi-a- 20.00G
dat.swapvol.1 vg-db-vda.[REDACTED] -wi-a- 1.00G
set.1._config vg-db-vda.[REDACTED] -wi-a- 3.00G
set.1._usr vg-db-vda.[REDACTED] -wi-a- 2.20G
set.1._var vg-db-vda.[REDACTED] -wi-a- 3.00G
set.1.root vg-db-vda.[REDACTED] -wi-a- 392.00M
set.2._config vg-db-vda.[REDACTED] -wi-a- 512.00M
set.2._usr vg-db-vda.[REDACTED] -wi-a- 1.25G
set.2._var vg-db-vda.[REDACTED] -wi-a- 3.00G
set.2.root vg-db-vda.[REDACTED] -wi-a- 256.00M
set.3._config vg-db-vda.[REDACTED] -wi-a- 512.00M
set.3._usr vg-db-vda.[REDACTED] -wi-a- 1.25G
set.3._var vg-db-vda.[REDACTED] -wi-a- 3.00G
set.3.root vg-db-vda.[REDACTED] -wi-a- 256.00M
# vgdisplay vg-db-vda.[REDACTED]
--- Volume group ---
VG Name vg-db-vda.[REDACTED]
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 28
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 16
Open LV 0
Max PV 0
Cur PV 1
Act PV 1
VG Size 98.78 GB
PE Size 4.00 MB
Total PE 25287
Alloc PE / Size 11999 / 46.87 GB
Free PE / Size 13288 / 51.91 GB
VG UUID Pr3x0P-li8m-HUAR-K5w4-v7Y0-vQjL-YyABNk
#
Note that over half the VG is unallocated blocks
Which can we recover?
# mkdir recover
# (for f in /dev/mapper/vg--db--vda.[REDACTED]*; do b=`basename "$f"`
&& cat "$f" > recover/"$b"; done)
cat: /dev/mapper/vg--db--vda.[REDACTED]-set.1._config# : Input/output error
All LVs read fine, except the one immediately above
# blkid /dev/mapper/vg--db--vda.[REDACTED]-set.1._config
/dev/mapper/vg--db--vda.[REDACTED]-set.1._config:
LABEL="set.1./config" UUID="ba980429-7a35-46c5-925b-9f4eadcc4ba0"
SEC_TYPE="ext2" TYPE="ext3"
# mkdir [REDACTED].config && mount -o ro,nosuid,nodev
/dev/mapper/vg--db--vda.[REDACTED]-set.1._config [REDACTED].config
# df -h [REDACTED].config
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg--db--vda.[REDACTED]-set.1._config
3.0G 72M 2.8G 3% /shared/vmdisks/[REDACTED].config
#
Filesytem only 3% used anyway ...
# find [REDACTED].config ! -type d ! -type f ! -type l -print
Only files, directories, and sym links ... can we back them all up?
Yes, all read and backed up without error:
(cd [REDACTED].config && tar -cf - .) | gzip -9 > [REDACTED].config.tar.gz
What about metadata?
# blkid /dev/mapper/vg--db--vda.[REDACTED]-set.1._config
/dev/mapper/vg--db--vda.[REDACTED]-set.1._config:
LABEL="set.1./config" UUID="ba980429-7a35-46c5-925b-9f4eadcc4ba0"
SEC_TYPE="ext2" TYPE="ext3"
# umount [REDACTED].config && rmdir [REDACTED].config
# sfdisk -uS -d /dev/loop0
# partition table of /dev/loop0
unit: sectors
/dev/loop0p1 : start= 1, size= 460655, Id= b, bootable
/dev/loop0p2 : start= 460656, size= 2096640, Id=82
/dev/loop0p3 : start= 2557296, size=207156096, Id=8e
/dev/loop0p4 : start= 0, size= 0, Id= 0
# vgchange -a n "vg-db-vda.[REDACTED]"
# vgdisplay vg-db-vda.[REDACTED]
VG UUID Pr3x0P-li8m-HUAR-K5w4-v7Y0-vQjL-YyABNk
# vgrename Pr3x0P-li8m-HUAR-K5w4-v7Y0-vQjL-YyABNk vg-db-vda
Volume group "vg-db-vda.[REDACTED]" successfully renamed to "vg-db-vda"
# mv [REDACTED].config.tar.gz [REDACTED].img.COPY recover/
# rm recover/vg--db--vda.[REDACTED]-set.1._config
#
# ssh -ax -l root [REDACTED] 'cd /shared/vmdisks && { df -h .;
/bin/hostname; ls -d rec*; }'
Password:
Filesystem Size Used Avail Use% Mounted on
/dev/md16 835G 14G 780G 2% /shared/vmdisks
[REDACTED]
ls: rec*: No such file or directory
# (cd /shared/vmdisks/recover && tar -cf - .) | gzip -9 >
../[REDACTED].recover.tar.gz
# scp -p ../[REDACTED].recover.tar.gz
root@[REDACTED]:/shared/vmdisks/recover.[REDACTED]/
Password:
[REDACTED].recover.tar.gz 100% 1718MB 63.6MB/s 00:27
# cd /shared/vmdisks
# scp -p $(find [A-Z]* -name recover -type d -prune -o \( -type f !
-name [REDACTED].img \) -print)
root@[REDACTED]:/shared/vmdisks/recover.[REDACTED]/
Password:
[REDACTED]_GUEST.img 100% 100GB
65.8MB/s 25:56
[REDACTED]_GUEST.info 100% 23
0.0KB/s 00:00
[REDACTED].info 100% 23
0.0KB/s 00:00
[REDACTED]_GUEST.img 100% 100GB
65.4MB/s 26:06
[REDACTED]_GUEST.info 100% 23
0.0KB/s 00:00
#
# mkdir recover.2
# cp -p [REDACTED].img recover.2/[REDACTED].img.COPY2
# echo $?
#
Note that it actually copied without error that time,
drive likely managed to successfully remap the sector it couldn't read.
# smartctl -x /dev/sda 2>&1 | sed -ne '/^ID#/,/prefailure/p'
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 197 197 051 - 302
3 Spin_Up_Time POS--K 198 198 021 - 3100
4 Start_Stop_Count -O--CK 100 100 000 - 46
5 Reallocated_Sector_Ct PO--CK 200 200 140 - 1
7 Seek_Error_Rate -OSR-K 200 200 000 - 0
9 Power_On_Hours -O--CK 093 093 000 - 5519
10 Spin_Retry_Count -O--CK 100 253 000 - 0
11 Calibration_Retry_Count -O--CK 100 253 000 - 0
12 Power_Cycle_Count -O--CK 100 100 000 - 44
192 Power-Off_Retract_Count -O--CK 200 200 000 - 36
193 Load_Cycle_Count -O--CK 200 200 000 - 20
194 Temperature_Celsius -O---K 124 100 000 - 26
196 Reallocated_Event_Count -O--CK 199 199 000 - 1
197 Current_Pending_Sector -O--CK 200 200 000 - 0
198 Offline_Uncorrectable ----CK 100 253 000 - 0
199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
200 Multi_Zone_Error_Rate ---R-- 100 253 000 - 0
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
#
Notice Current_Pending_Sector is now down to 0 - whereas earlier it was 1,
so the bad sector has been mapped out now (either recovered and
remapped, or remapped when written to). So seems we're good now.
# already copied all this stuff to other host, so:
# rm [REDACTED].recover.tar.gz && rm -rf recover recover.2
#
# mdadm --detail /dev/md16 | less
/dev/md16:
Version : 0.90
Creation Time : Tue Jun 17 22:25:58 2014
Raid Level : raid1
Array Size : 877072320 (836.44 GiB 898.12 GB)
Used Dev Size : 877072320 (836.44 GiB 898.12 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 16
Persistence : Superblock is persistent
Update Time : Sat Aug 4 18:39:59 2018
State : active, degraded
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
UUID : 165d54b9:8605b578:10b78e4a:3daa69b7
Events : 0.2291861
Number Major Minor RaidDevice State
0 253 2 0 active sync
/dev/vg-db-sda/mdm.app.vcmp.dat.vmdisks
1 0 0 1 removed
2 253 22 - spare
/dev/vg-db-sdb/mdm.app.vcmp.dat.vmdisks
#
# mdadm --remove /dev/md16 /dev/vg-db-sdb/mdm.app.vcmp.dat.vmdisks
mdadm: hot removed /dev/vg-db-sdb/mdm.app.vcmp.dat.vmdisks
# mdadm --zero-superblock /dev/vg-db-sdb/mdm.app.vcmp.dat.vmdisks
# mdadm --add /dev/md16 /dev/vg-db-sdb/mdm.app.vcmp.dat.vmdisks
mdadm: added /dev/vg-db-sdb/mdm.app.vcmp.dat.vmdisks
#
Did the remirror of md16 - it completed fine this time, and once it
completed, all looks fine on the drives.
While that was going on (actually started bit earlier), also did:
# dd if=/dev/zero of=.nulls obs=4096; rm .nulls
dd: writing to `.nulls': No space left on device
1719802064+0 records in
214975257+0 records out
880538652672 bytes (881 GB) copied, 7375.2 seconds, 119 MB/s
#
That was to write out most blocks on the filesystem - so any that might be
marginal would get remapped ... and the mirror completed fine, so all
we read okay.
Also ran [REDACTED] through a full reboot to ensure everything came up
and looked fine - that was also having reenabled the guests after all
the data was reading fine again.
Anyway, may not have shown all the steps in the preceding/above, but
showed at least key steps and data points.