I just read on Google’s official Webmaster blog that they’ve started experimenting with more advanced crawling to help them index pages inaccessible via links. Their crawler is actually filling and submitting forms from the site and check the results.
From the blog:
Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. [...] our crawl agent always adheres to robots.txt, nofollow, and noindex directives. That means that if a search form is forbidden in robots.txt, we won’t crawl any of the URLs that a form would generate. Similarly, we only retrieve GET forms and avoid forms that require any kind of user information.
The idea definitely sounds interesting, but I can’t help myself wondering how this might have unintended consequences. Why?
It all boils down to so many programmers failing to understand the differences between GET and POST and when to use one instead of the other. To many, the distinction is simply “GET is easy to test, ’cause you can test the URL in browser; POST hides the parameters”. This is not the whole picture though.
GET is intended to be used to, well, get data. A search form is a prime example: you enter a word and the server gives you back a list or results.
POST is for actions. Login is a good example, but basically any action that actually alters anything should be handled via a POST, although very often it is not.
I’ve seen plenty of cases where actions are sent via GET, query strings like ?action=delete&itemid=1131. Can you see the potential issues? Now, granted, googlebot will only crawl though public areas (they won’t touch password fields and such) so the damage should, in theory, be limited. In practice, the bot will try all possible combinations of a form’s widgets, like an automated tester but without the benefit of running in a supervised environment.
My advice? Have a look at your code and remember “if something can go wrong, it will”. If you see something not quite OK, better fix it now.
[...] popular … This is what I call the deep web information that regular search queries and regular …Google, Deep Web and GET vs. POSTSome little advice for web developers in the light of the new Google features … Google, Deep Web [...]