Blog

What's going on with the MGI Storage service lately?

Shortly after the storage maintenance of July 13, 2018, we saw a performance degredation event that has plagued us for the better part of the remainder of July. This document is an attempt to describe what we think we know about this event, what we believe the causes are, what we intend to do about it, and what you, the reader, can do to improve your user experience.

Executive Summary

  • Stop using Lucid Workstations, or, if you must, take steps to avoid touching /gscmnt from Lucid Workstations (and ask yourself, "Why must I use a lucid workstation?").
  • Use virtual-workstation2,3,4 and 5 if you need /gscuser.
  • Do your work on a Mac, if you have one.
  • Know that SMB access to data is coming soon.
  • To access /gscmnt, first SSH to virtual-workstation2,3,4 or 5 and bsub to an interactive shell.
  • Coming soon, virtual-workstations2,3,4 and 5 will be GPFS clients for /gscmnt access.
  • Stop using the "long" queue or any queue that uses the "blades" hostgroup, move to Docker.
  • Stop creating directories with millions of files.
  • If you have a common "use case" that cannot be accomplished with the above workarounds, let us know via a Service Desk ticket.


key summary status

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

 

Recap

A few days after the July 13 Maintenance, on July 19, RIS discovered evidence of "half-closed" TCP sockets (sockets "stuck" in CLOSE_WAIT) ITDEV-8793 - nfs_stale_mount_not_responding with lot of CLOSE_WAIT on ces3 IN PROGRESS . This correlated with an uptick in service desk Incident reports of storage latency. This is a problem that we've seen before, most recently in November of 2017, with at least 2 other documented cases before that. In all previously recorded occurances, the team spends many days debugging things only to have the problem disappear.

This time we were perhaps not so lucky, as the problem persisted for days.

There are various storage services across our infractructure, for various reasons, both technical and historic. The case we're talking about here is access to /gscmnt data mounts over NFSv3. This specifically is not /gscuser or /gsc (the home-app cluster), nor is it GPFS access via production HPC nodes. The problem is only NFS access to /gscmnt.

Current State

Things are basically functional at this time, but there have been no "fixes" deployed just yet. We have identified some workarounds to help staff remain productive, and we are taking actions to move services off NFS to help mitigate the impact.

Most Likely Root Cause

After many discussions with IBM, they believe they can describe the "most likely culprit". We cannot say with 100% certainty that the following is true, but we think it is. An aside on this uncertainty...

Our storage cluster is large. We have over 10PB of storage in active use by hundreds of compute nodes. The NFS server process, the Ganesha NFS daemon, must be interrogated in order to find the real root cause.

  • A bug in ganesha's signal handling means that we cannot attach strace without triggering a service restart.
  • Ganesha attempts to cache a lot of open files. Its memory footprint is enormous. Attempting to create trace files, debug logs, or core dumps takes hours, sometimes runs the machine out of memory, induces more performance problems, and sometimes fills local filesystems.

Without core dumps, gdb sessions, strace, and/or detailed debug logs, it is difficult to determine root causes.

Despite this fact, IBM feels relatively confident that the root cause of our problem stems from an algorithm used by Ganesha to cache directory content. If a user attempts to list the contents of a huge directory, Ganesha will attempt to retrieve all information about that directory in order to cache it. It turns out this is a blocking operation, and all Ganesha threads will block until the operation returns. We have recently discovered many directories within GPFS filesets that contain millions of files. In a case where it takes ls -U --color=never $DIR just a few seconds to return a list of 30 million files, that same call without those command line arguments will attempt to stat() each item in the directory in order to sort and colorize by file type.

[root@gennsd1 98433]# time ls -1 -U --color=never | wc -l
15266472

real    0m44.666s
user    0m4.802s
sys     0m7.410s

This behavior would cause that same ls to take hours or days to return. Similarly, the Ganesha thread would also attempt to cache all that information, taking hours or days to return. All other filesystem operations cease during this blocking period.

IBM tells us that they will provide a "fix" for Ganesha that will alter its behavior to not cache in this fashion. We are awaiting an implementation plan from IBM for this.

Some research on deleting large numbers of files led to this: https://www.slashroot.in/which-is-the-fastest-method-to-delete-files-in-linux

[root@gennsd1 98433]# find . -delete

Corrective/Preventative Action

So what's the plan for a "real fix"?

There are a few "real solutions" for the problem:

  1. Stop using NFS, switching to SMB and GPFS only
  2. Obtain a software fix from IBM for the Ganesha caching problem
  3. Find "large directories" and work with users to trim them down to reasonable sizes

All of these require some time and effort, so are there workarounds we can use to get around NFS while we work on the "real fixes"?

Workarounds

We are going to define some "core use cases" that we think are required for users to do their jobs. We will take actions to ensure that these core functions can be performed in a way that avoids NFS.

We first declare an assumption that all users have a computing device that functions without NFS: Users need a Macbook[1] and the following abilities:

  • Docker
  • LSF client
  • Access to /gscuser via NFS[2]
  • Access to /gscmnt data mounts[3]
  • Ability to move data, eg. via SSH/SCP, between /gscmnt and their Mac[4]
  • Ability to run IGV on data in /gscmnt[5]
  • Ability to print (via Macbook)
  • Ability to consume Common Services (like Office 365, again from Macbooks)

[1] While we believe it's true that all MGI analysts have Macbooks, we are aware that not all laboratory staff have Macbooks. Some people only have workstations. We will note some steps for those users to avoid NFS as well, and work with lab managers to see about replacing old workstations with Macbooks. If you must use a Lucid Workstation, try to avoid using /gscmnt from it.

[2] All MGI users have a HOME directory defined in LDAP that is at /gscuser/$USER that is accessed over NFS via the home-app cluster. This uses a different software stack and is not currently affected. The virtual-workstation2,3,4 and 5 provide /gscuser/$USER over NFSv4.

[3] All data volumes currently in /gscmnt are GPFS volumes. Access to these volumes in the HPC cluster is already always GPFS. Access by users from Macbooks will soon be via SMB. So much of the discussion in this document is a "workaround" until SMB access is provided.

[4] We will document a means of getting data, but again this will be SMB in the near future.

[5] IGV will also improve with the use of SMB.

Virtual Workstations

We have provided 4 VMs named virtual-workstation2,3,4 and 5. Users may SSH to these VMs and have access to their /gscuser NFS mounted HOME directory.

These VMs are LSF clients, thus bsub is available. Thus one may launch an interactive LSF job to gain access to /gscmnt data mounts.

There is a document describing how to copy data from /gscmnt to your Mac here How do I copy data from storage0 to my Mac? Note this is not yet 100% functional, we believe there's a firewall rule blocking access. We're working with WU IT Networking to fix that.

The VMs have Docker installed. You can work with Docker here if you want, though we expect most users to be doing Docker development from their Macs.

Note that the VMs do not have /gscmnt access. We will add GPFS to these to provide /gscmnt access without having to launch an interactive shell. But we prioritized this work after similar work for critical servers.

Critical Servers

There are some critical servers that provide functionality MGI users depend upon. Notably the Cron and Jenkins servers are vital to launching workloads for CLE and production sequencing activities. These currently rely upon NFS.

cron1

Work is underway to replace cron1 with a new set of Docker and GPFS capable Jenkins slaves to completely deprecate the old cron1 and make CLE and Lab functions more robust. We expect to finish these this week.

cron2

We also recognize that the GMS toolkit is critical for many and we're working with the GMS team to replace many of their Virtual Machines with new versions that are directly under their management. Part of this work will hopefully allow us to deprecate NFS there as well, though some of this depends upon SMB, which is not yet ready.

gscweb

At this time IGV users often access data via web URLs at https://gscweb.gsc.wustl.edu, which relies upon NFS. We are going to build a new GPFS client that will present /gscmnt data under Apache and separate this from gscweb (though we'll include a URL redirect so your URLs don't break).

bpd the barcode printing daemon

Barecode printers currently print via a process that runs on a VM with NFS. We'll move this to a GPFS version.

Real Solutions

Noting that all of the items above are "workarounds", it is worthy to note that "real solutions" are also coming in the near future.

Deprecation of Lucid

The LSF host group named "blades" are the old Ubuntu Lucid compute nodes (pre-Docker), which together with the Ubuntu Lucid workstation are the primary users of NFS aside from the servers mentioned above. The end of Ubuntu Lucid brings the end of NFSv3.

Deprecation of Lucid Workstations

As described in the Workarounds above, we are taking actions to provide a means for users to do work without a Lucid Workstation. The Lucid Workstation will not be "fixed".

We realize there is a demand for a Linux Workstation, which would be included by a RIS "Personal Computing" service. But that service, while it has been loosely discussed, does not exist yet. There are no real requirements laid out, nor a development road map, or funding model. So there is nothing to replace the Lucid Workstation just yet.

Until a "Personal Computing" service defines a modern workstation image, users should expect to use a Mac, or continue to use the Lucid Workstation with the aformentioned workarounds to avoid using NFS to access /gscmnt.

Deprecation of Lucid Compute Nodes

We've tried to leave the "blades" hostgroup and its associated queues active, but it is becoming increasingly difficult to support. We've wanted to avoid setting a "drop dead" date for these queues, to give people time to switch to Docker. Despite all best efforts, people continue to avoid moving to Docker. We still see regular workloads launched in the "long" queues.

These blades will not survive the Network Cutover to the Wash U networks. We don't have a firm date for that, but we are shooting to beat the end of August.

Migration Of User IDs from MGI IPA to Wash U Wustl Key

This is a prerequisite for the use of SMB to access data. Thy why and how are beyond the scope of this document.

Over the next month we will be preparing for the migration of all data in the storage cluster to be re-assigned from UID/GIDs in MGI IPA to Wash U AD also known as WUSTL Key.

Arrival of SMB, the end of NFS

Once authentication is managed by Wash U AD, we'll offer SMB access to the MGI storage cluster.

This will come with some re-arrangement so that all data/filesets live under some "owner's" fileset, like we do in the "research storage" cluster (storage1). Authentication will be required.

Deployment of Software Patches

We'll deploy IBM's software fixes, but with the deprecation of NFS, we won't care about Ganesha NFS any more.


Atlassian Software Update: Tuesday Jan 16 2018
The Atlassian tools JIRA, Service Desk, and Confluence will be offline during a scheduled maintenance for one hour (12:00pm to 1:00pm) on Tuesday January 16, 2018 to perform software upgrades. This is the rescheduling of our cancelled attempt from December 19. Details can be found here:
Let us know if you have any questions.
Changes to LSF Docker Containers

IT recently discovered a problem with LSF jobs using Docker containers. We're about to push out a fix to this problem and you should know about some slight behavior changes.

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

What was broken?

We recently found that the Docker application for LSF doesn't correctly handle pipes. If you submitted a command like cat $FILE | cut -f 1, only the left side of the pipe (cat $FILE, in this example) would run in the container. Everything else would run outside of the container, directly on the LSF host itself! The resulting process tree would look something like this:

\_ /opt/lsf9/9.1/linux2.6-glibc2.3-x86_64/etc/res ...
    \_ /bin/sh /home/vagrant/.lsbatch/1510946154.1543
        \_ /opt/lsf9/9.1/linux2.6-glibc2.3-x86_64/bin/docker_run.py cat $FILE
            \_ docker run ... bunch of options here ... $IMAGE cat $FILE
        \_ cut -f 1

So cat $FILE would run in the container. The cut command would run outside the container and would just receive the output of the docker_run.py script, which manages pulling, running, and removing the Docker image. And if the right side of the pipe wasn't something simple like cut, it would fail outright because our Docker HPC blades don't have much software installed on them beyond Docker, LSF, and GPFS.

What's changing?

Basically we're just sticking quotes around the submitted command in an LSF configuration file and making sure we properly escape shell commands at a few other levels in LSF. But there will be some changes in behavior that you'll notice!

Once this change goes live, if your command relies on any shell features (pipes, redirects, operators like '||' or '&&') you'll need to do one of two things:

  • Stick your command in a small script and have your job execute that script
  • Wrap your command in /bin/bash -c

For example:

> bsub -q research-hpc -a 'docker(ubuntu)' /gscuser/bdericks/script.sh
> bsub -q research-hpc -a 'docker(ubuntu)' /bin/bash -c "cat $FILE | cut -f 1 > /gscuser/bdericks/output.txt"

If you don't do the above and you attempt to use shell features, they won't be recognized as such. For example, if you submitted a command like sleep 15 | sleep 30, you'll see a failure like the below. The '|' character is not recognized as a shell pipe and is passed as a regular argument to sleep, which it doesn't recognize.

sleep: invalid time interval '|'
sleep: invalid time interval 'sleep'
Try 'sleep --help' for more information.

Why the change?

It's important that the full submitted command be run inside the container. At best you get confusing behavior. At worst there's security implications because MGI IT has restricted direct access to our Docker HPC hosts and this erroneous behavior allows users to get around that.

We tried several solutions that would preserve the ability to directly submit commands using shell features, but they each fell short in one way or another:

  • If we add a /bin/bash -c to the front of every submitted command, that breaks any images that use an entrypoint.
  • It's possible to detect if an image has an entry point using docker inspect. That does make it possible to preserve the entrypoint and wrap the submitted command in /bin/bash -c, but...
  • We can't assume that the Docker image has /bin/bash. Many minimal containers may not.
  • /bin/sh is often a symlink. And it points to different shells across different OS. That seemed like a mess for us to manage.

Ultimately, the user submitting the job should know if their image has an entrypoint and what shells are available.

Questions and Problems

If you notice any problems with your LSF jobs after this change goes out, make an IT Support Desk issue at https://servicedesk.gsc.wustl.edu and we'll help you out as soon as we can!

Update on aggr14 data movement

The volumes gc9020 gc4096 sata431 sata903 have been moved and are back online.

gc7001 is en route now.

Storage Maintenance Update re: aggr14

The aggr14 storage volumes are still offline.

Recall our last post, Scheduled Outage of Some Storage Volumes with work being tracked here:  Unable to locate Jira server for this macro. It may be due to Application Link configuration.

The filesystem check triggered an error in the IBM GPFS code:

Fri Jun 30 13:32:17.106 2017: [X] logAssertFailed: dataPtr.chunkPrevI(nextI) == prevI && chunkCount <= nChunks
Fri Jun 30 13:32:17.107 2017: [X] return code 317, reason code 1729, log record tag 65535
Fri Jun 30 13:32:23.425 2017: [X] *** Assert exp(dataPtr.chunkPrevI(nextI) == prevI && chunkCount <= nChunks) in line 2997 of file /project/sprelttn423/build/rttn423s001a/src/avs/fs/mmfs/ts/classes/basic/suballoc.C

The IBM technical support staff indicate this is a bug that requires a software patch, which they are working on.

Volumes will remain offline until we deploy the patch, re-run the filesystem check, and migrate the data. We will keep you posted as we hear from IBM.

Scheduled Outage of Some Storage Volumes

June 30 outage of aggr14

Maintenance Window June 30 2017

Some time ago, we discovered data corruption within some parts of the aggr14 filesystem. In order to repair the corruption, we must take the filesystem off line to run a filesystem check. We are planning to perform this maintenance beginning June 30.

Users may not recognize the name "aggr14", but you probably know what /gscmnt/* filesystem paths are important to you. See the list of affected volumes below.

  • We will be unmounting only aggr14 and its volumes for this repair.
  • The volumes will remain offline until July 5.
  • The list of affected volumes is included below.
  • While the list looks long, much of the data is old and unused.
  • If you are concerned about this outage, please reach out to us via the Service Desk and Ask a question about the storage management service

Details of the outage can be tracked here:  Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Affected volumes

 

/gscmnt/411
/gscmnt/442
/gscmnt/gc1400
/gscmnt/gc1401
/gscmnt/gc1402
/gscmnt/gc1403
/gscmnt/gc1404
/gscmnt/gc2000
/gscmnt/gc2001
/gscmnt/gc2002
/gscmnt/gc2003
/gscmnt/gc2004
/gscmnt/gc2005
/gscmnt/gc2006
/gscmnt/gc2007
/gscmnt/gc2008
/gscmnt/gc2009
/gscmnt/gc2010
/gscmnt/gc2011
/gscmnt/gc2012
/gscmnt/gc2013
/gscmnt/gc2014
/gscmnt/gc2015
/gscmnt/gc2016
/gscmnt/gc2709
/gscmnt/gc2721
/gscmnt/gc2722
/gscmnt/gc3011
/gscmnt/gc3019
/gscmnt/gc3031
/gscmnt/gc4000
/gscmnt/gc4001
/gscmnt/gc4002
/gscmnt/gc4003
/gscmnt/gc4004
/gscmnt/gc4005
/gscmnt/gc4006
/gscmnt/gc4007
/gscmnt/gc4008
/gscmnt/gc4009
/gscmnt/gc4010
/gscmnt/gc4011
/gscmnt/gc4012
/gscmnt/gc4013
/gscmnt/gc4014
/gscmnt/gc4015
/gscmnt/gc4016
/gscmnt/gc4017
/gscmnt/gc4018
/gscmnt/gc4020
/gscmnt/gc4021
/gscmnt/gc4022
/gscmnt/gc4023
/gscmnt/gc4024
/gscmnt/gc4025
/gscmnt/gc4026
/gscmnt/gc4027
/gscmnt/gc4028
/gscmnt/gc4029
/gscmnt/gc4030
/gscmnt/gc4031
/gscmnt/gc4032
/gscmnt/gc4033
/gscmnt/gc4034
/gscmnt/gc4035
/gscmnt/gc4036
/gscmnt/gc4037
/gscmnt/gc4038
/gscmnt/gc4039
/gscmnt/gc4040
/gscmnt/gc4041
/gscmnt/gc4042
/gscmnt/gc4043
/gscmnt/gc4044
/gscmnt/gc4045
/gscmnt/gc4046
/gscmnt/gc4047
/gscmnt/gc4048
/gscmnt/gc4049
/gscmnt/gc4050
/gscmnt/gc4051
/gscmnt/gc4052
/gscmnt/gc4053
/gscmnt/gc4054
/gscmnt/gc4055
/gscmnt/gc4056
/gscmnt/gc4057
/gscmnt/gc4058
/gscmnt/gc4059
/gscmnt/gc4060
/gscmnt/gc4061
/gscmnt/gc4062
/gscmnt/gc4063
/gscmnt/gc4064
/gscmnt/gc4065
/gscmnt/gc4066
/gscmnt/gc4067
/gscmnt/gc4068
/gscmnt/gc4069
/gscmnt/gc4070
/gscmnt/gc4071
/gscmnt/gc4072
/gscmnt/gc4073
/gscmnt/gc4074
/gscmnt/gc4075
/gscmnt/gc4076
/gscmnt/gc4077
/gscmnt/gc4078
/gscmnt/gc4079
/gscmnt/gc4080
/gscmnt/gc4081
/gscmnt/gc4082
/gscmnt/gc4083
/gscmnt/gc4084
/gscmnt/gc4085
/gscmnt/gc4086
/gscmnt/gc4087
/gscmnt/gc4088
/gscmnt/gc4089
/gscmnt/gc4090
/gscmnt/gc4091
/gscmnt/gc4092
/gscmnt/gc4093
/gscmnt/gc4094
/gscmnt/gc4095
/gscmnt/gc4096
/gscmnt/gc5100
/gscmnt/gc5110
/gscmnt/gc6106
/gscmnt/gc6117
/gscmnt/gc6152
/gscmnt/gc7000
/gscmnt/gc7001
/gscmnt/gc7002
/gscmnt/gc7003
/gscmnt/gc7004
/gscmnt/gc7005
/gscmnt/gc7006
/gscmnt/gc7007
/gscmnt/gc7008
/gscmnt/gc9000
/gscmnt/gc9001
/gscmnt/gc9002
/gscmnt/gc9003
/gscmnt/gc9004
/gscmnt/gc9005
/gscmnt/gc9006
/gscmnt/gc9007
/gscmnt/gc9008
/gscmnt/gc9010
/gscmnt/gc9011
/gscmnt/gc9012
/gscmnt/gc9014
/gscmnt/gc9015
/gscmnt/gc9016
/gscmnt/gc9017
/gscmnt/gc9018
/gscmnt/gc9019
/gscmnt/gc9020
/gscmnt/gc9021
/gscmnt/gc9022
/gscmnt/gc9023
/gscmnt/gc9024
/gscmnt/gc9025
/gscmnt/gc9026
/gscmnt/sata100
/gscmnt/sata102
/gscmnt/sata107
/gscmnt/sata198
/gscmnt/sata199
/gscmnt/sata205
/gscmnt/sata400
/gscmnt/sata408
/gscmnt/sata416
/gscmnt/sata417
/gscmnt/sata418
/gscmnt/sata425
/gscmnt/sata426
/gscmnt/sata427
/gscmnt/sata430
/gscmnt/sata431
/gscmnt/sata432
/gscmnt/sata433
/gscmnt/sata434
/gscmnt/sata810
/gscmnt/sata820
/gscmnt/sata821
/gscmnt/sata828
/gscmnt/sata830
/gscmnt/sata834
/gscmnt/sata835
/gscmnt/sata838
/gscmnt/sata840
/gscmnt/sata844
/gscmnt/sata845
/gscmnt/sata846
/gscmnt/sata847
/gscmnt/sata848
/gscmnt/sata849
/gscmnt/sata856
/gscmnt/sata859
/gscmnt/sata860
/gscmnt/sata863
/gscmnt/sata864
/gscmnt/sata865
/gscmnt/sata866
/gscmnt/sata867
/gscmnt/sata870
/gscmnt/sata871
/gscmnt/sata873
/gscmnt/sata874
/gscmnt/sata875
/gscmnt/sata876
/gscmnt/sata877
/gscmnt/sata878
/gscmnt/sata882
/gscmnt/sata883
/gscmnt/sata888
/gscmnt/sata889
/gscmnt/sata890
/gscmnt/sata891
/gscmnt/sata896
/gscmnt/sata897
/gscmnt/sata902
/gscmnt/sata903
/gscmnt/sata904
/gscmnt/sata905
/gscmnt/sata906
/gscmnt/sata908
/gscmnt/sata909
/gscmnt/sata912
/gscmnt/sata913
/gscmnt/sata914
/gscmnt/sata918
/gscmnt/sata919
/gscmnt/sata920
/gscmnt/temp233

 

 

Cleanup of Legacy LIMS Code

In preparation for migrating our database from Oracle to Postgres, the AppDev team is undergoing a widespread cleanup of old code and unused database tables. This cleanup will reduce the total size of our database and make the migration much easier. All aspects of this cleanup have been vetted by members of AppDev, and are not expected to affect anyone's current daily work. This cleanup is scheduled to occur early next week, and we will be watching closely for any unexpected fallout. The broad categories for which we are removing code include:

  • mp_grande and primer design
  • Code / tables related to oligo ordering
  • The concept of creating "validation" work orders (i.e. work orders created specifically for primer design, not for sequencing). The old work orders will continue to exist, but will not be displayed specially by the code any longer.
  • The concept of "importing" sequence tags (i.e. gene annotations from ensembl, genbank, etc). We are no longer supporting the tables where this data used to be stored.
  • 3730 sequencing
  • 454 sequencing
  • Support for 'pooling' of capture probes (i.e. merging BED files). Please see (How do I check in or merge capture reagents?) for information about how to use GMS for these needs.

As always, please place a service desk ticket if you have any questions or notice anything unusual as a result of this cleanup.

What's up with storage latency lately?

Summary

We are aware of an uptick in issues related to storage latency in the last several weeks. Remediation actions were delayed as we prepared for then executed the May 16 Maintenance Outage. With that work complete, we turned our attention back to this latency problem. This week we think we made some progress identifying causes and changes are being made to address them.

Details

The relevant hosts are the "production-hpc" and "research-hpc" LSF host groups. These two host groups have evolved over the last several months to include the blade17 and blade14 hosts. You may guess, correctly, that these are different generations of computer:

  • blade14: 96671G RAM, 24 processor, Intel(R) Xeon(R) CPU X5660 @ 2.80GHz, 2x 1G NIC
  • blade17: 386746G RAM, 48 processor, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 2x 10G NIC

The blade17s have much more RAM, CPU, and network power than the blade14s.

We tune the blades in a number of ways to try to optimize the kernel for their workloads. Several of the tuning parameters relate to memory management for the GPFS cluster storage software. GPFS requires 16G of memory. In addition the Linux OS needs memory to do basic things like drive the networks, run SSH, puppet, cron, fork/exec basic bash programs, etc. So we take the 16G for GPFS and add a cushion for the OS, reserving 25G of RAM for the OS. We use an LSF "elim" program (elim.mem) to subtract this 25G of RAM from the memory to be offered to LSF jobs. So on a blade14, there's ~71G of RAM available for use by LSF jobs.

During recent occurances of "storage latency", we observe processes unable to allocate memory:

error: fork: Cannot allocate memory

One process that reports this is "ssh". The GPFS cluster software utilizes ssh for delivering commands to its cluster members. When ssh can't fork, it can't execute instructions.

> ssh root@blade15-1-12
ssh_exchange_identification: read: Connection reset by peer

When this happens, the cluster members can't talk to the blade:

May 25th 2017, 09:23:55.000 linuscs116 mmfs Thu May 25 09:23:45.980 2017: [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:23:52.000 linuscs117 mmfs Thu May 25 09:23:46.156 2017: [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:23:52.000 home-app3 mmfs Thu May 25 09:23:47.648 2017: [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:23:47.000 pnsd2 mmfs Thu May 25 09:23:46.505 2017: [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:23:46.000 pnsd2 mmfs [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:10:59.000 pnsd1 mmfs Thu May 25 09:10:52.739 2017: [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:09:38.000 home-app4 mmfs Thu May 25 09:09:30.206 2017: [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:09:30.000 home-app4 mmfs [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:09:30.000 home-app4 mmfs [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:00:35.000 linuscs118 mmfs Thu May 25 09:00:29.430 2017: [E] Connection from 10.100.5.172 timed out
May 25th 2017, 09:00:31.000 linuscs88 mmfs Thu May 25 09:00:25.079 2017: [E] Connection from 10.100.5.172 timed out

At this point, the cluster members must decide what to do about the unresponsive node. Filesystem activity pauses:

Thu May 25 07:52:31.888 2017: [I] Recovering nodes in cluster gpfs-home-app.gsc.wustl.edu: 10.100.5.172
Thu May 25 07:53:05.269 2017: [N] Node 10.100.5.172 (blade15-1-12) lease renewal is overdue. Pinging to check if it is alive
Thu May 25 07:54:58.546 2017: [D] Leave protocol detail info: LA: 165 LFLG: 4883640 LFLG delta: 165
Thu May 25 07:54:58.559 2017: [I] Recovering nodes in cluster gpfs-sol.gsc.wustl.edu: 10.100.5.172
Thu May 25 07:55:05.295 2017: [E] Node 10.100.5.172 (blade15-1-12) is being expelled because of an expired lease. Pings sent: 60. Replies received: 60.
Thu May 25 07:55:06.074 2017: [I] Recovering nodes in cluster gpfs.gsc.wustl.edu: 10.100.5.172
Thu May 25 07:55:07.837 2017: [I] Log recovery for log group 212 in aggr14 completed in 0.130482000s
Thu May 25 07:55:08.205 2017: [I] Recovered 1 nodes for file system aggr14.
Thu May 25 07:55:09.902 2017: [D] Leave protocol detail info: LA: 165 LFLG: 4883651 LFLG delta: 165
Thu May 25 07:55:09.931 2017: [I] Recovering nodes in cluster gpfs-sol2.gsc.wustl.edu: 10.100.5.172
Thu May 25 09:10:52.739 2017: [E] Connection from 10.100.5.172 timed out

Now at this point, the reader might wonder, "What good is this clustered filesystem if everything stops when a node goes bad?" At this point, please pause to remember your Computer Science, the CAP theorem (https://en.wikipedia.org/wiki/CAP_theorem).

In theoretical computer science, the CAP theorem, also named Brewer's theorem after computer scientist Eric Brewer, states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: Consistency, Availability, Partition tolerance


In other words, the CAP Theorem states that in the presence of a network partition, one has to choose between consistency and availability.

For a cluster filesystem, would you rather choose "Available" or "Consistent"? That is, if you choose availability, you must accept data corruption. Here, we choose Consistency, and thus give up Availability.

In short, we'd rather have your filesystem be slow than corrupt your data.

Why are we running out of memory?

But why are we running out of memory? We're reserving some for the OS. We impose limits in LSF. What are we missing?

Yesterday we re-discovered one of our tuning parameters.

root@blade17-1-1:~# sysctl vm.min_free_kbytes
vm.min_free_kbytes = 11631000
 
(~/git/puppet-modules)-(master)
(ins)-> grep -A1 vm.min_free_kbytes hiera/roles/ostack_kilo_hpc.yaml
  'vm.min_free_kbytes':
    'value': '11631000'

What's this parameter for?

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-tunables.html

min_free_kbytes

The minimum number of kilobytes to keep free across the system. This value is used to compute a watermark value for each low memory zone, which are then assigned a number of reserved free pages proportional to their size.

Be cautious when setting this parameter, as both too-low and too-high values can be damaging and break your system. Setting min_free_kbytes too low prevents the system from reclaiming memory. This can result in system hangs and OOM-killing multiple processes.

However, setting this parameter to a value that is too high (5-10% of total system memory) will cause your system to become out-of-memory immediately. Linux is designed to use all available RAM to cache file system data. Setting a high min_free_kbytes value results in the system spending too much time reclaiming memory.

This parameter must be tuned to find a Goldilocks value that is not too small and not too large. Based on our history (https://jira.gsc.wustl.edu/browse/INFOSYS-15484) we've set this value to 3% of total memory. But we learned two things this week:

  1. We set this number to a fixed value across all the HPC nodes, missing the fact that the blade14s have much less RAM than the blade17s. On the blade17s it was 3%, but the same number on a blade14 is 12% of its RAM!
  2. We failed to account for this amount of RAM in the number we reserve for LSF. This allows LSF jobs to consume memory we should be reserving for the kernel!

Issue #1 is fixed here https://jira.gsc.wustl.edu/browse/ITDEV-3309 and was deployed last night.

Issue #2 is being tracked here https://jira.gsc.wustl.edu/browse/ITDEV-3311 and will be deployed as soon as possible.

In addition to these parameters being fixed, we're also going to update our server tests to auto-close blades when we detect problems like these (and others, like improper permissions on the docker socket). That is being tracked here: https://jira.gsc.wustl.edu/browse/ITDEV-3317

We are hopeful that these improvements will return stability to the cluster, and you can get back to your work!

 

May 16 Maintenance Complete, Followup Begins

The May 16 maintenance outage was finished last week, with some drama related to Samba services (smb-cluster) and the aggr14 filesystem. All services were restored by late last week, and we've made a collection of "follow up" tasks that you can see here:

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

It took about 4 hours just to get things cleanly shut down for maintenance. The primary goals, to enable DMAPI on GPFS filesystems, were accomplished relatively soon after that. The secondary goals regarding the home-app cluster were completed next, making future matinenance on home-app servers much easier. The unexpected work came after, when Samba services did not properly return, and the repair of the aggr14 filesystem corruption took some time. Regarding aggr14, it turns out that some of the reported filesystem corruption was in fact a false positive report caused by a bug that has since been fixed in the next version of GPFS. The number of files that were actually corrupted was small, and the data was recovered after the filesystem came back online. So, in the end, no data was lost at all. Ironically, the aggr14 data in question is scheduled for deletion.

As usual, we thank you for your patience during maintenance outages. We know that the interruption can be frustrating.

IT Systems Outage May 16 2017

There will be an IT Systems Outage on Tuesday May 16 2017 beginning at 5:15pm

Details of this outage, its goals and impact, can be found here: Maintenance Window May 16 2017

In brief:

  • All running LSF jobs will be terminated
  • Any pending LSF jobs will be left pending
  • User sessions will be terminated, save your work and log out, but leave your workstation running

 

@genome Mail Aliases End Of Life

On May 15, 2017 any remaining @genome mail aliases will be retired.
If you have not yet moved your alias to an O365 Group, please do so prior to May 15.
?
If you no longer need a listed alias, please let us know it can be removed in service desk:  https://jira.gsc.wustl.edu/servicedesk/customer/portal/1/create/25?
If you would like assistance with how to create a group in O365, please make a request via the service desk:https://jira.gsc.wustl.edu/servicedesk/customer/portal/1/create/24?
?
Remaining aliases and owners are listed in the file below.

RT ticketing systems EOL

We are approximately one month away from final transition of all mail services to O365.  RT depended heavily on our internal mail routing services which will no longer be available and we will not be upgrading RT to work with O365.  Therefore, RT will no longer function at this time and we suggest you plan accordingly to move you ticket/issue tracking needs to JIRA. 

The remaining active RT queues are:
apipe-support
assets  
facilities  
medical-genomics
ng-library-construction
pipette-repair
project-information
project-management
repair
resource-bank
service-contracts
Shipping-Request-ResourceBank

You can make the request for a JIRA project in the IT service desk: https://jira.gsc.wustl.edu/servicedesk/customer/portal/1/create/51

Docker Migration Poll

I've been calling this "The Docker Road Show", talking to anyone who will listen about IT efforts to decommission our old Ubuntu Lucid compute image in favor of a more modern infrastructure based on Docker containers. You may know that the software that the IT team uses to run "production" pipelines has already been moved to Docker. You may also know that other teams around MGI have worked hard to move their tools into Docker as well. But the pace of migration is largely unknown to us, as we attempt to communicate across many different teams of differing levels of interest.

The goals of this IT Blog post are manifold:

  • Further communicate our mission to decommission the legacy compute image.
  • Get some feedback from the user community. How are things going?
  • Accelerate the migration. If people won't miss the Lucid LSF slots, we'll migrate more hosts to the Docker enabled image.

I put up this new page in the ITKB space to show how many of which sort of host exist today:

Conversion of all High Performance Computing (HPC) to Docker

I created this Doodle Poll in an attempt to learn how many of you have begun using Docker:

Where are you on your Docker journey?

Please take a moment to participate in the poll and let us know how things are going!

Don't forget the ITKB has a number of links about Docker, including: How do I use Docker? (Where do I start?) 

Docker-Interactive LSF Queue

MGI has had an interactive LSF queue for years that's useful for messing around with tools in the same environment that HPC jobs run in. With Docker seeing more and more use around MGI, we thought it would be useful to add a docker-interactive queue. In addition, we've increased the run limit on both queues to 1 week (from 24 hours).

We've also pushed out a new docker-interactive command to all hosts that use LSF. If you run this command without any arguments, you'll get an interactive session in the lucid-default container, which is based on Ubuntu Lucid with basic tools like wget and vim installed (you can see the full Dockerfile here). You can also pass the command a Docker image and it'll use that instead.

All of the hosts behind the interactive and docker-interactive queues use the same NFS server as workstations (ces-workstation), which should help keep things stable and responsive.

If you have any problems using the docker-interactive queue or command, let us know in a Support Desk issue and we'll get to it as soon as we can. Have a good weekend!

Moving Research HPC from NFS to GPFS

Hey all! As some of you Docker users may know, things haven't been running so smoothly in the research-hpc queue since we've started using NFS there. Turns out, NFS can be slow. And if it gets slow enough, Docker containers time out and won't start... (sad) To fix that, we've decided to buy more GPFS licenses and install the GPFS client on the research-hpc hosts, which should improve speed and stability!

This is a high priority fix, so we've already closed the hosts and will be killing jobs in the research-hpc queue to get this upgrade out quickly and get the queue flowing again. I'll be sending an email to allmgi with a link to this post and a follow-up email once the upgrade is done.

We're confident this change will work because it will make research-hpc identical to production-hpc, which has been humming along at high throughput with no stability issues for a couple weeks after a bunch of tweaking and improvements.

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

I'll make another blog post with some details about what went wrong and why we chose to fix it this way, but that'll have to wait until things have stabilized a bit more. In the mean time, you can follow along here:

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Thanks for your patience!