Uploaded image for project: 'MusicBrainz Search Server'
  1. MusicBrainz Search Server
  2. SEARCH-410

Search server resource conflict - sockets left in CLOSE_WAIT state

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: High High
    • None
    • None
    • None
    • None

      I ordered 48GB of ram for each of the servers and that didn't improve the problem of the servers tipping over every few hours. Before the ram upgrade, the problem had become so bad that the servers were restarting several times an hour, giving overall terrible results.

      After the ram upgrade, nothing changed. Literally nothing, so this problem is NOT a gradual leak memory problem. Which explains why all of the attempts to look for memory problems didn't pan out. I then watched the search servers for a failure. Once it failed, I took it out of rotation and then poked at it. lsof showed that the search server had 800 sockets in the CLOSE_WAIT state. Any connections to the search server timed out – clearly the server cannot open new sockets. As soon as I restarted the server, all of the CLOSE_WAIT sockets went away.

      Researching the CLOSE_WAIT socket issue, I found this:

      http://unix.derkeiler.com/Mailing-Lists/SunManagers/2006-01/msg00367.html

      Effectively, the problem lies with the program, not the OS. This is the conclusion over and over again. So, to rule out Jetty from being the problem, I moved back tomcat and installing the latest version 8.0.20. I deployed this on both search servers and monitored the count of CLOSE_WAIT sockets. There were no sockets leaking, and the number of open files stayed roughly constant. Then, early this morning, both servers tipped over again. I performed the same examination and found 4,125 sockets in the CLOSE_WAIT state.

      This rules out the container as the cause of problem and the search server is causing the problem. Now I've confirmed this for sure and we have more clues with which to debug. To summarize the problem:

      1. Everything is fine. memory usage gradually increases – I assume this is Lucene's doing. During this phase there are NO sockets in CLOSE_WAIT.
      2. At some point later, after running for a while, something happens and it happens FAST. In a very short period of time thousands of sockets get stuck in CLOSE_WAIT.
      3. The server stops accepting connections. Connections to the server time out. On jetty, CPU utilization goes to nil, on tomcat it goes to 80%.

      Given what I've read, we should double check to ensure that the search server properly closes its connections, especially in an error situation. I suspect that some error happens and the handling of that error then causes all further connections to stick in CLOSE_WAIT.

            Unassigned Unassigned
            rob Robert Kaye
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:

                Version Package