< Wikimedia Cloud Services team < EnhancementProposals

Wikimedia Cloud Services team/EnhancementProposals/SpareDisks

Problem Statement

WMCS servers currently do not have hot spares disks added to their RAID configuration.

We depend on Dell's 24-hour support contract to deliver new replacements to us.

Since we standardize on RAID-10, a single disk failure means that the whole volume has no redundancy anymore and it's at the mercy of the remaining disk in the pair that failed to keep working until a replacement disk arrives, is actually installed and completes the rebuilding process. This means we could be running without redundancy for a few days.

Due to the way that RAID-10 and SSDs work, wear levels could be the same for both SSDs in a pair. So both are at high risk of failure, meaning the remaining disk could fail at any moment too.

Additionally, miscommunication with Dell support or shipping delays could mean additional delays. We also don't have 24x7 DC-Ops staff to start working on a disk replacement immediately and there are organizational challenges in monitoring/alerting that could add even more delays.

Proposal

Adopt a new RAID standard where 2 disks are defined as hot spares. They will become active immediately after a failure is detected by the RAID controller, giving all teams more time to react and reducing the time window where we are without redundancy in our RAID volumes.

In essence, trade storage capacity for faster MTTR.

Technical Impact

Here is the current situation for our various servers:

ServerVendorDisksTypeRaw CapacityCurrent RAID-10RAID-10 w/ sparesVariationCurrent Disk Usage
labvirt1001HP1615k 300GB4.8TB2.4TB2.1TB-12.5%1.2TB
labvirt1002HP1615k 300GB4.8TB2.4TB2.1TB-12.5%1.2TB
labvirt1003HP1615k 300GB4.8TB2.4TB2.1TB-12.5%0.9TB
labvirt1004HP1615k 300GB4.8TB2.4TB2.1TB-12.5%1.2TB
labvirt1005HP1615k 300GB4.8TB2.4TB2.1TB-12.5%0.9TB
labvirt1006HP1615k 300GB4.8TB2.4TB2.1TB-12.5%0.7TB
labvirt1007HP1615k 300GB4.8TB2.4TB2.1TB-12.5%1.3TB
labvirt1008HP1615k 300GB4.8TB2.4TB2.1TB-12.5%0.5TB
cloudvirt1009HP1615k 300GB4.8TB2.4TB2.1TB-12.5%-
cloudvirt1012HP6SSD 1.6TB9.6TB4.8TB3.2TB-33%-
cloudvirt1013HP6SSD 1.6TB9.6TB4.8TB3.2TB-33%0.3TB
cloudvirt1014HP6SSD 1.6TB9.6TB4.8TB3.2TB-33%0.5TB
cloudvirt1015Dell10SSD 1.6TB16TB8TB6.4TB-20%-
cloudvirt1016Dell10SSD 1.6TB16TB8TB6.4TB-20%2.1TB
cloudvirt1017Dell10SSD 1.6TB16TB8TB6.4TB-20%1.5TB
cloudvirt1018Dell10SSD 1.6TB16TB8TB6.4TB-20%1.1TB
cloudvirt1019HP10SSD 1.6TB16TB8TB6.4TB-20%1.5TB
cloudvirt1020HP10SSD 1.6TB16TB8TB6.4TB-20%-
cloudvirt1021Dell10SSD 1.6TB16TB8TB6.4TB-20%1.3TB
cloudvirt1022Dell10SSD 1.6TB16TB8TB6.4TB-20%2.8TB
cloudvirt1023Dell10SSD 1.6TB16TB8TB6.4TB-20%2.4TB
cloudvirt1024Dell10SSD 1.6TB16TB8TB6.4TB-20%0.2TB
cloudvirt1025Dell6SSD 1.8TB10.8TB5.4TB3.6TB-33%0.5TB
cloudvirt1026Dell6SSD 1.8TB10.8TB5.4TB3.6TB-33%1.1TB
cloudvirt1027Dell6SSD 1.8TB10.8TB5.4TB3.6TB-33%1TB
cloudvirt1028Dell6SSD 1.8TB10.8TB5.4TB3.6TB-33%1TB
cloudvirt1029Dell6SSD 1.8TB10.8TB5.4TB3.6TB-33%0.5TB
cloudvirt1030Dell6SSD 1.8TB10.8TB5.4TB3.6TB-33%1.5TB
cloudvirtan1001Dell127.2k 4TB48TB24TB20TB-17%-
cloudvirtan1002Dell127.2k 4TB48TB24TB20TB-17%-
cloudvirtan1003Dell127.2k 4TB48TB24TB20TB-17%-
cloudvirtan1004Dell127.2k 4TB48TB24TB20TB-17%-
cloudvirtan1005Dell127.2k 4TB48TB24TB20TB-17%-
labstore1004Dell267.2k 2TB52TB26TB24TB-8%-
labstore1005Dell267.2k 2TB52TB26TB24TB-8%-
labstore1006HP127.2k 6TB72TB36TB30TB-17%-
labstore1007HP127.2k 6TB72TB36TB30TB-17%-
cloudstore1008Dell127.2k 6TB72TB36TB30TB-17%-
cloudstore1009Dell127.2k 6TB72TB36TB30TB-17%-

Important notes:

  • Nominal disk capacity used (base 10 sizes published by vendor)
  • Disks dedicated to operating system are ignored
  • cloudvirtan* are "owned" by the Analytics team and may or may not be sized to allow for spares
  • labstore100{4,5,6,7} are scheduled to be decommissioned with instances on cloudvirt10{19,20} as the replacement

Timeline

If this proposal is accepted, we could adopt two strategies for deployment:

  • Piggyback on our migration to Stretch/Mitaka on the hypervisors and take each reimage opportunity to modify the RAID configuration
  • Schedule downtime for the remaining hypervisors that are already in Stretch/Mitaka so they're drained, reimaged and put back into production

Since draining hypervisors is a very disruptive process, the complete implementation of this proposal would have to take into account how much downtime we are comfortable with.

It's expected to be a year goal if not more.

Voting

Please add more stakeholders as needed. Vote Yes/No and a justification.

NameVoteComment
Andrew Bogott--
Arturo BorreroYesWe should increase robustness and resilience of CloudVPS. I know this involves capacity/budget/refresh planning.
Brooke Storm--
Bryan DavisYesSupport for the piggyback strategy + investigating "converged infrastructure" idea of re-purposing cloudvirt local storage as Ceph storage that is exposed back to the cloudvirts for instance storage.
Giovanni TirloniYesReasons: Engineer time is more expensive than cost of spare disks. Lack of redundancy is unacceptable in face of data loss (which has already occurred). We cannot maintain any meaningful SLA with humans in the critical path.

Decision

While some team members didn't cast their vote formally in this document, the majority seemed to agree that adopting hot spare disks was a good strategy in a meeting held on Feb 12.

We will:

  • Reconfigure RAID arrays to add hot spares at every opportunity we have (reimages, moving hypervisors to the new eqiad1 region, etc)
  • We will not drain existing hypervisors simply to reconfigure RAID because it's a too time-consuming process
  • New servers being bought for codfw and only serving dev/test purposes will have only 1 hot spare disk instead of 2 (like in production hypervisors)
  • Investigate using the hypervisors as Ceph nodes themselves in a "converged infrastructure" approach

RAID configuration

Initial RAID configuration is done using the last two disks as hot spares. As disks fail and get replaced the hot spares will end up being in different slots though.

Dell servers

Reboot and reconfigure it through the UI interface.

Remember to go into 'Advanced' while creating the volume and select 'Add hot spares' and 'Initialize'.

When selecting the disks for the volume, leave the last 2 disks unselected. Unintuitively, a new window will pop up after the volume is creating asking you to select the spare disks.

HP servers (Gen8)

There is an issue pressing F3 to delete existing RAID volume. If possible, run commands from Linux.

Show current status:

=> ctrl slot=0 pd all show

Smart Array P420i in Slot 0 (Embedded)

   array A

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK)

   array B

      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 300 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 300 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 300 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 300 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 300 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 300 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 300 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 300 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 300 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 300 GB, OK)
      physicaldrive 1I:1:13 (port 1I:box 1:bay 13, SAS, 300 GB, OK)
      physicaldrive 2I:1:14 (port 2I:box 1:bay 14, SAS, 300 GB, OK)
      physicaldrive 2I:1:15 (port 2I:box 1:bay 15, SAS, 300 GB, OK)
      physicaldrive 2I:1:16 (port 2I:box 1:bay 16, SAS, 300 GB, OK)
      physicaldrive 2I:1:17 (port 2I:box 1:bay 17, SAS, 300 GB, OK)
      physicaldrive 2I:1:18 (port 2I:box 1:bay 18, SAS, 300 GB, OK)

Delete existing RAID volume:

=> ctrl slot=0 ld 2 delete forced

Warning: Deleting an array can cause other array letters to become renamed.
         E.g. Deleting array A from arrays A,B,C will result in two remaining
         arrays A,B ... not B,C

Create new RAID volume (leaving 2 disks for spares):

=> ctrl slot=0 create type=ld raid=1+0 drives=1I:1:3,1I:1:4,1I:1:5,1I:1:6,1I:1:7,1I:1:8,1I:1:9,1I:1:10,1I:1:11,1I:1:12,1I:1:13,2I:1:14,2I:1:15,2I:1:16

Add hot spares to the array:

=> ctrl slot=0 array B add spares=2I:1:17,2I:1:18

Show RAID volume:

=> ctrl slot=0 ld 2 show          

Smart Array P420i in Slot 0 (Embedded)

   array B

      Logical Drive: 2
         Size: 1.9 TB
         Fault Tolerance: 1+0
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 256 KB
         Full Stripe Size: 1792 KB
         Status: OK
         Caching:  Enabled
         Unique Identifier: 600508B1001C2FD858B7C664FB32BECD
         Disk Name: /dev/sdb 
         Mount Points: None
         Logical Drive Label: 045EEA760014380311954D026CC
         Mirror Group 1:
            physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 300 GB, OK)
            physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 300 GB, OK)
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 300 GB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 300 GB, OK)
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 300 GB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 300 GB, OK)
            physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 300 GB, OK)
         Mirror Group 2:
            physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 300 GB, OK)
            physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 300 GB, OK)
            physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 300 GB, OK)
            physicaldrive 1I:1:13 (port 1I:box 1:bay 13, SAS, 300 GB, OK)
            physicaldrive 2I:1:14 (port 2I:box 1:bay 14, SAS, 300 GB, OK)
            physicaldrive 2I:1:15 (port 2I:box 1:bay 15, SAS, 300 GB, OK)
            physicaldrive 2I:1:16 (port 2I:box 1:bay 16, SAS, 300 GB, OK)
         Drive Type: Data
         LD Acceleration Method: Controller Cache

Show all disks (confirm there are 2 spares):

=> ctrl slot=0 pd all show

Smart Array P420i in Slot 0 (Embedded)

   array A

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK)

   array B

      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 300 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 300 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 300 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 300 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 300 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 300 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 300 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 300 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 300 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 300 GB, OK)
      physicaldrive 1I:1:13 (port 1I:box 1:bay 13, SAS, 300 GB, OK)
      physicaldrive 2I:1:14 (port 2I:box 1:bay 14, SAS, 300 GB, OK)
      physicaldrive 2I:1:15 (port 2I:box 1:bay 15, SAS, 300 GB, OK)
      physicaldrive 2I:1:16 (port 2I:box 1:bay 16, SAS, 300 GB, OK)
      physicaldrive 2I:1:17 (port 2I:box 1:bay 17, SAS, 300 GB, OK, spare)
      physicaldrive 2I:1:18 (port 2I:box 1:bay 18, SAS, 300 GB, OK, spare)

HP servers (Gen9)

Enter the HP RAID configuration:

  • Reboot server
  • Press ESC+9 to enter menu
  • Select RAID controler on slot 1
  • Select option to open configuration utility
  • Wait for "error: no such device: HPEZCD260" message to disappear

Verify current situation:

=> controller slot=1 ld 1 show  

Smart Array P840 in Slot 1

   Array A

      Logical Drive: 1
         Size: 7.3 TB
         Fault Tolerance: 1+0
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 256 KB
         Full Stripe Size: 1280 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Unique Identifier: 600508B1001CB0D3B3EFD3A1715AB007
         Disk Name: /dev/sdd 
         Mount Points: None
         Logical Drive Label: 0110F6E7PDNNF0ARH8015T82C5
         Mirror Group 1:
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SATA SSD, 1.6 TB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA SSD, 1.6 TB, OK)
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SATA SSD, 1.6 TB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SATA SSD, 1.6 TB, OK)
            physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SATA SSD, 1.6 TB, OK)
         Mirror Group 2:
            physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SATA SSD, 1.6 TB, OK)
            physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SATA SSD, 1.6 TB, OK)
            physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SATA SSD, 1.6 TB, OK)
            physicaldrive 2I:2:1 (port 2I:box 2:bay 1, SATA SSD, 1.6 TB, OK)
            physicaldrive 2I:2:2 (port 2I:box 2:bay 2, SATA SSD, 1.6 TB, OK)
         Drive Type: Data
         LD Acceleration Method: SSD Smart Path

Delete existing array:

=> ctrl slot=1 array A delete forced

Warning: Deleting the specified device(s) will result in data being lost.
         Continue? (y/n) y


Confirm all disks are now unassigned:

=> ctrl slot=1 pd all show

Smart Array P840 in Slot 1

   Unassigned

      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SATA SSD, 1.6 TB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA SSD, 1.6 TB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SATA SSD, 1.6 TB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:2:1 (port 2I:box 2:bay 1, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:2:2 (port 2I:box 2:bay 2, SATA SSD, 1.6 TB, OK)

Create a new array / logic device (leaving two disks to be spares later):

=> ctrl slot=1 create type=ld drives=1I:1:5,1I:1:6,1I:1:7,1I:1:8,2I:1:1,2I:1:2,2I:1:3,2I:1:4 raid=1+0 forced

Warning: SSD Over Provisioning Optimization will be performed on the physical
         drives in this array. This process may take a long time and cause this
         application to appear unresponsive. Continue? (y/n)y

Add the last 2 disks as spares:

=> ctrl slot=1 array all add spares=2I:2:1,2I:2:2

Resources

Add papers, external links, etc, that support proposal

This article is issued from Wikimedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.