Making Sense of Troubleshooting and Preventive Medicine…

If you’ve ever had to troubleshoot a SharePoint issue within the realm of the third iteration of SharePoint’s platform, then you know that there’s more than just what you’ll find in Central Admin that sometimes requires tinkering to resolve problems.

I’ve dealt with everything from timer jobs not firing off due to daylight savings time patches not being applied, to workflows not working properly due to network latency and message traffic not arriving when it was supposed to, to the joys of sAMAccountNames being modified after a user accessed a site, to the glories of psconfig failing to provision and deprovision web applications properly during an upgrade and leaving a cloud of dust within the ULS logs.

I’m not here to tell war stories, but rather to provide a few ideas and suggestions when attempting to troubleshoot a problem.

1 – Document everything – How is this troubleshooting?  It’s not really, it’s more the preventive medicine for when you’re going to have to troubleshoot… consider it a part of the Boy Scout Motto "Be Prepared".  Knowing your interfaces to other systems, your taxonomies (security, site and features), and your architecture (both physical and logical of everything) will save you hours and hours of time when you’re attempting to troubleshoot an issue.  Otherwise, troubleshooting becomes a blind analysis, feeling along the walls hoping to find the issue.  I’d recommend keeping a OneNote journal with configuration settings and changes for your systems so as to consolidate information to a single source (or if you want to use Google Sites, Notebook or Docs, that’s cool too :)).

2 – Know your AD environment – do you have custom domain security policies that are being applied to a specific organizational unit?  Did someone inadvertently move your server where they shouldn’t have within an OU structure while they were performing directory maintenance and now regardless of what you do to try to reconfigure your server the domain policy continues to lock it down?  Knowing your AD environment and providing relevant data to your domain administrator will at least allow you to rule out the possibility that it’s something outside your immediate control.

3 – Plan your system appropriately – this goes back to #1.  If you aren’t planning things out appropriately in a technical sense and haven’t put forth a plan of how you’re going to implement a system, it’s going to be a while, get a Snickers bar.  I’d recommend by starting with the planning worksheets as defined in the SharePoint 2007 Deployment Guide and Checklists – better yet, build a project plan so that you’re able to be sure you’ve thought through everything.  If you’ve got your system planned appropriately and you have your documentation handy which shows how you configured Kerberos and the affiliated SPNs in your domain schema troubleshooting should be too easy, right?

4 – Be prepared to hit the logs for troubleshooting.  There are two logs that you should probably be acutely familiar with – the IIS logs for the associated web applications in your SharePoint enclave, as well as the Unified Logging System (ULS) logs for SharePoint.  If you’re familiar with web applications and how to read IIS logs, then you should be fine and not have any issues.  ULS logs for SharePoint on the other hand can be somewhat cryptic in nature.  I would highly recommend using something like the SharePoint Logging Spy from CodePlex to provide insight into what is truly going on within your SharePoint instance.

5 – Did you check to make sure your interfaces were still connected?  It’s always embarrassing when you realize after the fact that your data communications problems with SQL server weren’t necessarily a password change or a malicious DOS attack to down your data sources, but just a lose HBA or Ethernet connection.  As my CCNA instructor mentioned five years ago, start at the bottom of the OSI model and work your way up.

Are these the only five things you need to know and consider when troubleshooting?  By all means no.  I would recommend having a few other resources handy when troubleshooting as well (e.g. Google, Live Search, TechNet, me) near by to diagnose an issue and work toward a solid solution to fix the problem in the most elegant way possible (and remember to document the fix should it ever pop up).


Comments

Leave a comment

Blog at WordPress.com.