We get this question often. It isn't always immediately obvious what's going on with a web site or service that is "down," and additional information about what our probes are seeing can be helpful. For example, if your website is timing out, is it the web server, a DNS problem, or maybe packet loss on the network? NodePing's Automated Diagnostics and on-demand Diagnostic Tools allow you to run several utilities to see what our probes (or your AGENTs) are seeing about your web site or service. These tools can be useful to narrow down where the failure is occurring so you can get things fixed and service restored as quickly as possible.
Use the ping tool to help detect packet loss or routing problems between our probe servers and your services. We'll send 10 ICMP packets to the target IP or FQDN from the probe you choose. Look for % of packet loss in the summary line at the bottom of the results.
The traceroute tool is helpful in determining at what point a route is failing. It can also show what firewall is blocking our probes. The results show the route taken from our probe to your target IP or FQDN. Each line is a "hop" and shows the latency between your target and that node along the route. Look to see if the last line is your destination. If it isn't, there's either a routing issue, firewall, or the server is down.
Use the mtr tool to help detect packet loss and routing problems. It's like ping and traceroute combined. Running an mtr is a good place to start when you see a 'Timeout' failure. Look at the "Loss%" column to see packet loss. If the last line isn't your destination, it likely indicates either a routing issue, firewall, or the server is down.
The dig tool is for finding DNS issues. Look for errors like "no servers could be reached", which indicates that the DNS server is unavailable. A missing "ANSWER SECTION" means there is no resolution for that FQDN.
Use the screenshot tool for your HTTP checks to get a visual snapshot of your site. This diagnostic tool isn't available from AGENT locations.
You can enabled automated diagnostics for a check and when the check fails, our system will inspect the failure and run an appropriate set of diagnostics.
For timeout and connectivity errors, we'll run a set of MTRs from the probes that verified the check was failing.
For DNS errors, we'll run dig queries to your nameservers from the probes.
Diagnostic results can viewed in the 'Check Status Report' as well as via our API.
You can optionally enabled having the results of automated diagnostics emailed to any contact methods that are set to receive immediate notifications on the checks. The setting can be found in Account Settings - Notifications Settings
You can access the on-demand Diagnostic Tools in our web UI by clicking on the 'Diagnostic Tools' tab. Diagnostic tools are also available in our API.
On-demand Diagnostics are limited to 10 requests in 5 minutes. You can contact support to request a diagnostic rate increase.
Here are some tips for troubleshooting some of the most common failures seen for the various check types.
HTTP checks can be the most complex of our checks to troubleshoot because they rely on so many other services to function.
HTTP checks usually require DNS, IP routing, proper firewall configuration, a working SSL certificate, a functioning database, and running webserver. It's no wonder we receive more questions about failed HTTP checks than any other check types.
One of the fastest ways to diagnose an HTTP failure is to create checks for all the dependent services (DNS, ping, port, and SSL). If your HTTP check fails but none of the other checks fail, it's a good indicator that the webserver is the cause of the trouble. If your ping check fails, you can figure that the server is offline, routing is broken, or your datacenter is experiencing packet loss.
Timeout failures can be the hardest to determine what's going wrong. We recommend you start with running an MTR to see if there is any routing or packet loss between our probes and your service. If the MTR doesn't show any problems, run a dig test against your URL FQDN to see if it's your slow or failed DNS service that's causing the timeout. Please note that page load and screenshot tests are not useful for HTTP checks that are timing out.
500 errors usually indicate a problem on the server itself. It could be the database is offline or your box has run out of memory. For 500 errors, you'll want to contact your hosting company's support.
Our ICMP PING checks are fairly simple and there's only a few things that can go wrong. Either the server is offline (turned off), a firewall is blocking the pings, the network is unroutable, or there's packet loss somewhere along the route.
Timeout failures can be verified by using the ping diagnostic tool running from the location your check is running. This will verify that the pings are, in fact, not being replied to. To check for routing and packet loss problems, run an mtr diagnostic. It will let you know where along the route our pings stop receiving a reply. If the mtr is showing that it's sometimes able to reach your server, you'll be able to see the rough approximation of the packet loss seen between our probe and your server.
DNS can be difficult to troubleshoot since changes sometimes take many hours to propagate. Use our dig diagnostic tool to query a specific FQDN against a specific DNS server (or the default DNS server on our probe) to see if resolution is correct.
The most common issue we see with SSH checks are timeouts that the owner is unable to reproduce. These are usually caused by unresponsive DNS servers for the PTR records of the SSH host. OpenSSH clients, by default, do a reverse IP lookup when connecting to a server. If the DNS servers are not responding, the SSH check may timeout before ever trying to connect. The challenge comes when someone who often connects to that SSH server is able to connect, but our probes are not. This is because their client has cached the PTR record and the OpenSSH client uses the cache and then connects. We recommend creating DNS checks for the PTR record of the IP addresses for the servers you have SSH checks for. You can also use our dig diagnostic tool to see if the PTR records are available.
Our various email checks are similar to HTTP checks in that they rely on several other services to function properly. DNS, routing, and packet loss can all cause mail checks to fail.
Timeout failures should first be verified by running an MTR to see if there is any routing or packet loss between our probes and your service. If the MTR doesn't show any problems, run a dig test against the checks FQDN to see if it's your slow or failed DNS service that's causing the timeout.
If you're still having trouble determining the cause of an outage, NodePing support is always available and happy to help. Send an email to firstname.lastname@example.org with the check label or id and we'll do what we can to help you troubleshoot.