Sunday, December 25, 2011

Troubleshooting faulty network connectivity, part 1: A step-by-step guide


“My computer won’t connect to the Internet”.
“I can’t get online”.
“I’m getting ‘page cannot be displayed’ errors”.
If you have worked as a network administrator or help desk specialist, you have probably heard these complaints and variants of them more times than you can count. Between improper IP addressing, malfunctioning hardware or faulty cables, incorrect DNS resolution, incorrect access permissions and a slew of other factors, there are many reasons why modern computer users could experience difficulty getting their client machines online and/or connecting to the resources that they need (i.e., the Internet in general, a certain website, a network share, a printer, etc).
Given the many interconnected technologies that work together to provide modern computer networking, it is no surprise that the inability to get online is one of the most common complaints of computer users. An inspection of any Web discussion forum focusing on technical support will reveal that networking problems are among the most common challenges that users face (see, for example, Computer Hope and Tech Support Guy).
Difficulties can range from the inability to get an IP address from a DHCP server, incorrect DNS server IP addresses on the client, malfunctioning DNS servers, a faulty network interface card (NIC) or NIC driver, malware attacks, strict firewall rules, misconfigured Web browsers, etc.
Since so many networking component technologies could be the culprits behind a lack of connectivity, a unified methodology towards troubleshooting network errors and outages is ideal for technicians. Why? To help ensure that you do not overlook any steps, or do any unnecessary or repetitive work as you attempt to establish the required connectivity.

The CompTIA Network+ troubleshooting model provides a broad, high level plan of action for remedying faulty connectivity.
  1. Identify the symptoms and potential causes. In this first step you define and determine the nature of the problem. Is it the user or computer that is problematic? Are all websites unreachable, or just one or a few? Is the computer consistently online or is the connection flapping? Are websites reachable by IP address but not by name? Are there any error messages indicating what type of error was encountered? Based on your answers to these questions you can begin to make educated guesses as to the cause. Gather detailed information.
  2. Identify the affected area. This step is similar to the first step, but here you determine the extent of the problem. Is it affecting one computer or user, or multiple computers or users? Are all computers in the subnet (or all users in the domain) affected? Is the whole network down? If you are providing support to another user, can you reproduce the error yourself? Gather detailed information.
  3. Establish what has changed. This is where you try to put the connectivity problem in some kind of time frame. Find out if the user was ever able to successfully do what he now cannot do. When did the error first appear? Before the appearance of the error were there any programs or operating system updates installed? How about new drivers or browser plugins? Were any new nodes (clients, servers, networking devices, printers, etc) added to the network? Any new users, user groups, or Active Directory objects such as domains, OUs, or sites?
  4. Establish the most probable cause. Use your technical expertise to isolate and explain the cause of the problem. Some in-depth investigation and diagnostic tools will probably be required. This step is described in more detail below.
  5. Determine if escalation is necessary. If you believe that the connectivity error is outside your scope of administration, you will need to transfer responsibility for its resolution to another entity. For example, if you cannot connect to the Internet and you strongly believe that the problem is not your computer, router, or other equipment, you will need to contact your Internet service provider (ISP) and ask them to investigate…perhaps one of their lines or routers is down. Another scenario might involve a piece of equipment on your network that is contractually administered by a third party.
  6. Implement an action plan and solution including potential effects. Whether the responsibility to fix the error falls on you or if you need to transfer it to another party, you must devise a resolution and start checking off action items.
  7. Test the result. When you believe the steps of the action plan have been fulfilled, try to re-create the error. Observe the results of the implementation. Is the problem gone? If not, repeat steps 4, 5, and 6.
  8. Identify the results and effects of the solution. Once a solution has been found, ensure that normal network operation has been restored and that no new problems have been introduced.
  9. Document the solution and process. Describe the error conditions and the steps taken for a solution. This will aid you in troubleshooting the same or similar problems in the future. Make sure you include a sufficient amount of detail, including operating system versions, application versions, driver versions, software vendor update numbers, etc.

Step 4 in detail

Step 4, ‘Establish the most probable cause’, is the one that will require the most time and effort on your part. To troubleshoot TCP/IP connectivity on a technical level, this step involves its own set of steps. When a network host cannot connect to a desired resource, the following procedures will help you narrow down the cause of the problem.
  1. Ping the local loopback address of your network adapter, which is 127.0.0.1. If you get replies, you know that the TCP/IP protocol suite is installed and functioning correctly. If this ping fails, it might be because the TCP drivers are corrupted or the network adapter might not be working.
  2. Ping your own IP address. Use the command ipconfig (or ifconfig for Linux/Unix) to determine your IP address. If you don’t have one, this is a very useful clue as to the nature of your problem…if your computer’s routing table is correct, this ping is simply forwarded to the loopback address of 127.0.0.1. Note: If your IP address is returned as 169.254.x.x, you have been assigned an APIPA IP address. This means that the local DHCP server is not configured properly or cannot be reached from your computer, and an IP address has been assigned automatically. APIPA is a feature of Microsoft Windows that acts as a DHCP failover mechanism. When a DHCP request or server fails, APIPA allocates addresses in the private range 169.254.0.1 to 169.254.255.254. This functionality is also known as Link-Local address assignment or zero configuration networking.
  3. Ping the IP address of a known good host in your subnet.This will determine if your IP address and subnet mask are valid, and if the switch or router you are connected to is functioning properly; perhaps you are plugged into a faulty port or your network cable is faulty. Also, can other hosts in the subnet ping your IP address? If pings in this step fail, ensure that the network cable is plugged into the network adapter and ensure that you get a link light. The link light indicates whether a network connection exists between the NIC and the network. If the link light is off, the cable or switch port are probably at fault, or perhaps the NIC itself is toast. If you can ping both the loopback address and your own IP address but not hosts in the local subnet, try to clear out the ARP cache and reload it. This can be done by using the Arp utility on the command line interface (CLI). First display the cache entries with the arp -a or arp -g commands. Delete the entries with arp -d <IP address>.
  4. Ping the IP address of the default gateway. This is important if the error revolves around the inability to access hosts and resources in other networks, such as networks at remote sites or on the Internet. The default gateway on a local area network (LAN) is the network host to which all data traffic bound for other subnets is sent. If network hosts do not know the IP address of the default gateway, they cannot send data traffic to remote networks. Therefore the default gateway serves as an access point to other networks. Similarly it acts as the entry point into the network for remote hosts sending data to local hosts.
  5. Ping the IP address of a host in another subnet. If you can reach the default gateway, this step will verify that there is proper routing between your subnet and the destination subnet.
  6. Ping the host name of a host in another subnet. This step will help ensure proper DNS resolution of host names of remote hosts. If you cannot successfully ping the remote host name after successfully pinging the remote host’s IP address, the problem is with DNS and not with network connectivity itself.
  7. Run a Tracert (traceroute in Linux/Unix) and PathPing analysis to a remote host to verify that the routers between you and the destination host are operating correctly. Traceroute/tracert is a network diagnostic tool for displaying the route (path) and measuring transit delays of packets across an IP network. PathPing combines the features of the ping and tracert commands. It sends packets to each router (hop) over a period of time and computes results for each hop. In this way PathPing can show if there are any problematic routers between you and the destination.
  8. Other troubleshooting steps you might try include:
    • Issue the ipconfig /release and ipconfig /renew commands.
    • Issue the ipconfig /flushdns and ipconfig /registerdns commands.
    • Right-click the network icon in the system tray and choose ‘Diagnose and repair’ (Vista) or ‘Troubleshoot problems’ (Windows 7). In Windows XP right-click the network connection and choose ‘Repair’.
    • Reboot your computer. Power cycle your switch, router, and cable/DSL modem (unplug them and plug them back in). Does the router firmware need to be updated? It should be at the latest release. Are all cables plugged in properly on all components, such as the network adapter on the computer and on all other devices until the cable goes into the wall?
    • Is the computer joined to the proper workgroup or domain? In a workgroup all computers are peers and each computer has its own set of user accounts. To log on to any computer in the workgroup, you must have an account on that computer. All computers must be in the same subnet and use the same workgroup name.
      A domain is a group of accounts and network resources that share a common directory database and set of security policies. One or more computers are servers. Network administrators use servers to control the security and permissions for all computers on the domain. Domain users must provide a password or other credentials each time they access the domain. If you have a user account on the domain, you can log on to any computer on the domain without needing an account on that particular computer. The computers can be on different local networks.
    • Take a peek at the Windows Event Logs using the Event Viewer (Control Panel -> Administrative Tools -> Event Viewer, or Start, type eventvwr, hit Enter. In Linux take a look at /var/log/messages using the tail or less commands. Graphical tools for reading event logs in Linux will vary based on your distribution and desktop environment (such as Gnome or KDE).
For some other in-depth articles dealing with network connectivity troubleshooting, you may want to read Mitch Tulloch’s TCP/IP Troubleshooting: A Structured Approach and Brien Posey’s Troubleshooting Connectivity Problems on Windows Networks.
Go to Part 2 of this series.

No comments:

Post a Comment