List diving and web scraping has definitely become an art form in recent years. I’m lucky enough to chat with Aaron Lintz once in a while and absorb some advanced techniques in unlocking the secrets of the Matrix. Usually, the stuff is too advanced for me the first time he explains it.
One day while looking for directories and attendee lists we got to talking about data extraction and some specific syntax. He pulled up this robots.txt trick I’d never seen before. When Lintz pulled up the robots file and started indexing entire websites, I thought the sentinels would seek and destroy to disallow access. However, this is publicly shared information, so anyone can view the .txt files for company websites (without fear of vengeful gigantic robots). The outcome has been pretty eye opening, particularly on the medical side of things.
Enter the Matrix of Robots.txt
So what the heck is robots.txt? It’s a syntax put into many organization’s websites in order to “block” or “disallow” automated bots, search engines, and web crawlers from accessing specific URLs. By accessing the robots.txt file, you are looking for clues for more information to bypass the “disallow.” Typically, one can access a public sitemap (which often times may be an index of the entire website). Sometimes you can stumble upon something bigger. I’ve had some success (and a ton of fun) while looking for indexing trends or trying promising URL variations.
Keep in mind this type of method requires much trial and error and is more appropriately used when you’ve exhausted most of the normal outlets. This is truly meant for the deep web dives.
Following the White Rabbit
Start with the main company website. This typically works best with larger organizations, but anything is worth a quick look:
Let’s index a hospital (based in NY).
Go to the main website: www.mskcc.org
Add /robots.txt to the tail end of the URL: https://www.mskcc.org/robots.txt
This pulls a pretty expansive list of programming jargon (with the nice explanation of the robots file). Just ignore this, for now, we are looking for a sitemap which would be a table of contents for the website (and usually at the bottom of the text file).
The highlighted link is the sitemap. Plug this into your browser: https://www.mskcc.org/sitemap.xml
This sitemap leads to 31 one other sitemaps. The bio clinical URL looks promising as well as the doctor file. Trial and error lead us to this full list of clinical profiles:
https://www.mskcc.org/sitemap.xml?entity_type=node&type=bio_clinical
Click directly on the link and it sends us straight to a Nurse Practitioner.
Now you can use some of the cross-referencing techniques here to dig in further.
Learning Kung Fu and Trendsetting
Again, not all these searches are this clean and easy. Dead ends are common when messing around with this stuff, but you can identify certain trends when exploring sitemaps from similar organizations.
Sometimes you can try a certain URL segment and pull similar results.
For example, put this into your web browser: http://memorialhermann.org/robots.txt
Notice the following line: Disallow: /doctors.htm
Like in the Matrix, you can bend the rules a bit. Bypass the “disallow” and substitute /doctors.htm in the URL instead of /robots.txt:
http://www.memorialhermann.org/doctors.htm
We just found a zion of MDs. All those glorious links with the doctor’s names and listed departments just waiting to be sourced!
Speak URL, not Boolean, Neo
We can take it a step further from here. Say we need MDs based in emergency medicine. We search the matrix for “emergency medicine” and find a short list of potential candidates. Notice you’ll need to use a – between words in this search since we are speaking URL, not Boolean.
Ctrl-F (to pull up your finder) and search: emergency-medicine
We found our candidate, and it looks like she has experience with pediatrics as well.
This can be much easier than accessing a company’s contact finder. You can see the full list of results and change the keywords quickly instead of working within a profile finder that may be more limiting or that may have some restrictions on total access.
Now you know a little URL kung fu. As you can see, it can land a big score if you look at a website through another doorway. I hope this helps. Happy hunting!