“mobile”/ Voluum
Welcome to Our Community
Wanting to join the rest of our members? Feel free to sign up today.

hide your robots.txt from visitors and show it only for validated robots

Discussion in 'Programming and Scripts' started by Bagi Zoltán, Nov 22, 2007.

  1. Bagi Zoltán

    Bagi Zoltán Guest

    After some hours searching and hacking i have finally found everything to build a solution which makes possible to hide the content of your robots.txt file from visitors but make it display ONLY for validated user agents such as googlebot, Yahoo Slurp and msnbot.

    You may find the whole thing very strange why would somebody hide that content? My answer for this question is the following:
    That content (folder structure of the core script files) is a private information, and don't want to share it with every script kiddies to make the possible to hurt my site.

    How to execute this hack? I will guide through..

    1. As first step you need to add these lines to your .htaccess file, or if you don't have create one and upload it to the root domain folder.


    Code:
    RewriteEngine On
    RewriteCond %{http_user_agent} !(googlebot|Msnbot|Slurp) [NC]
    RewriteRule ^robots\.txt$ http://seo.i-connector.com/  [R,NE,L]
    AddHandler application/x-httpd-php .txt
    I think i don't have to explain the first row, the second and the thirs says that if you are not one of the three big search engines and want to reach the robots.txt file you will be redirected to the main domain. It is very handy since a lot of people set their homepage as the landing page of 404 errors, so the cloacking won't be recognised. (will talk about the cloacking a bit later as well)
    The fourth row make possible that your robots.txt file behave as a php script.

    Now you are ready with the first step, lets see what else you need to do.

    2. Open a text editor or your favourite web editor application and insert the code below into a new file save as reversedns.php and upload it to your root folder.

    PHP:
    <?php
    $ua 
    $_SERVER['HTTP_USER_AGENT'];
    if(
    stristr($ua'msnbot') || stristr($ua'Googlebot') || stristr($ua'Yahoo Slurp')){
    $ip $_SERVER['REMOTE_ADDR'];
    $hostname gethostbyaddr($ip);
    if(!
    preg_match("/\.googlebot\.com$/"$hostname) &&!preg_match("/search\.live\.com$/"$hostname) &&!preg_match("/crawl\.yahoo\.net$/"$hostname)) {
    $block TRUE;
    $URL="/";
    header ("Location: $URL");
    exit;
    } else {
    $real_ip gethostbyname($hostname);
    if(
    $ip!= $real_ip){
    $block TRUE;
    $URL="/";
    header ("Location: $URL");
    exit;
    } else {
    $block FALSE;
    }
    }
    }
    ?>
    This script can be famaliar for many of you. This is a hacked version of the reversedns.php file which was presented some months ago. According to the hack if the robot can not be validated the script will redirect it to your main domain. So i return back for a min to the cloaking or not cloaking issue. I had to recognise that google are not capable to protect my rankings from exploits, so i have to defend myself, hence i belice it is not a bad cloacking only a protection solution. If somebody mask him/herself as googlebot he/she will fail during this robot valadiation so will be redirected to the main domain via php. No way to recognise the cloacking!

    3. And as the last step
    Open the robots.txt file you would like to protect and insert the code below to the first line.
    PHP:
    <?php include("reversedns.php"); ?>
    You are done, and your robots.txt file is in safe!

    Thanks!
     
    Last edited: Aug 28, 2008
  2. CPA Evolution
  3. Midlandi

    Midlandi Affiliate affiliate

    238
    0
    0
    Nice work Bagi....:D
     
  4. Bagi Zoltán

    Bagi Zoltán Guest

    Thank you Midi, it took my afternoon. :)
     
  5. temi

    temi Affiliate affiliate

    13,674
    53
    0
    This is brilliant Bagi, thanks for sharing, I know what you created this for originally :)

    Fellow UK WW members, please digg this post :)
     
  6. pow-wow

    pow-wow Affiliate affiliate

    240
    0
    0
    Nice post! this is great
     
  7. gkd_uk

    gkd_uk Well-Known Member affiliate

    4,159
    72
    48
    Great post - Dugg and rep added
     
  8. temi

    temi Affiliate affiliate

    13,674
    53
    0
    Biodun, did you digg the article?
     
  9. SkinnerW

    SkinnerW Guest

    Bagi,

    Do you mind if we refer to your article in UKWW blog?

    Digged, stumbled and rep added
     
  10. Bagi Zoltán

    Bagi Zoltán Guest

    No Skinner, that is absolutely no problem.:) Thanks for the digg the rep and the stumble:)
     
  11. Azar

    Azar Affiliate affiliate

    2
    0
    1
    Hi Bagi!! Awesome post. This is really helpful!

    Can you add a few more search bots (host address) in this line?

    if(!preg_match("/\.googlebot\.com$/", $hostname) &&!preg_match("/search\.live\.com$/", $hostname) &&!preg_match("/crawl\.yahoo\.net$/", $hostname)) {
    }

    I would like to check for 'aolbuild|baidu|bingbot|bingpreview|duckduckgo|adsbot-google|mediapartners-google|teoma|yandex' . Can you give me the match strings for the mentioned bots to add in my 'if condition' you have it in your code? ( !preg_match for all the bots - I can't find the string anywhere).

    Please advice.
     
  12. Graybeard

    Graybeard Well-Known Member affiliate

    5,356
    2,629
    113
    How to Use an "if X or X" in a preg_match Statement

    PHP:
    if(!preg_match("/(bot1|bot2|bot3)/"$hostname)) {....}
    This thread is 10+ years old :D
    It's also bad as you always want to match what is allowed> !<= NOT to secure something there is no 'not bot' list lol -- too many and then there are fake bots that have the wrong IP (really AS block)



    | = or "/( | | | )/" <<<set of what could match

    Really, I think it's a waste of time ... and resource intensive -- reverse DNS is slow -- Googlebot, for example; may request the robots.txt 2 or 3 times in a row ... Then if the script messes up or Google comes in with a cloaked User-Agent (they do that BTW) you could mess up your indexing and create a mess that is hard to recover from.

    This is silly. Hostile User-Agents do not even request the robots.txt the same way burglars do not ring the door bell or knock first -- they want to just sneak in of course.
     
    Last edited: Jun 29, 2018
  13. Azar

    Azar Affiliate affiliate

    2
    0
    1
    Thanks for your alert. I will stop implementing then.
     
banners