Coding, Tech and Developers Blog
Blazor has been around for some years now as an emerging web application platform for .NET users and developers. If, at some point, you decide to use it as a real-world project, there are a few things you will have to solve in a .NET-way. Today, we will look at how you can prevent search engine crawlers from inspecting all of your pages.
I have followed the development and evolution of Blazor closely over the past few years because I consider it a great and powerful alternative to the many PHP frameworks out there. As of now, this website is built with Statamic and I am happy with it right now. But, to be honest, I would be much more confident implementing and changing features if it were a web application built on top of .NET. To me, Blazor continuously proves to be a valid alternative and I have used it successfully in a few projects.
In this article, we will have a look at a real-world problem when it comes to publicly accessible sites with Blazor. Well, not a problem actually, but something we will have to do in a .NET/Blazor way, instead of HTML: Robots.txt and robot meta tags.
All of our known search engines (remember Lycos? Still out there!) use a software entity called a
crawler to continuously go through all of the internet's web pages and scan them for new content. That being said, every search that will be performed on any search engine, will only act upon an indexed snapshot of the web at that point. It can never be really up to date. There will obviously be occasions when you'd like to instruct those crawlers not to inspect a specific portion of your web application. For example, who would like their WordPress admin pages to appear in the search results? Likewise, you'd probably like your disclaimer and, consequently, your address, not to appear in the top five search results on any search engine.
There are basically two different ways in which we can guide crawlers on how to inspect our web pages and applications. One is with a file called
robots.txt, and the other is with meta tags that are placed directly in the
head compartment of one or many pages on our site.
robots.txt is a file that is usually served via
www.yourdomain.com/robots.txt and is considered to be valid for your whole site. With it, you can instruct crawlers to avoid crawling unimportant content on your pages. It is a means to prevent overwhelming traffic on your page caused by crawlers.
It is, however, not a way to explicitly prevent crawlers from inspecting certain areas. Pages, even if blacklisted in the
robots.txt can still appear on search results (they just won't have any meaningful description text).
Also, the behavior on how the
robots.txt is interpreted is up to the crawler, i.e., up to the search engine. It is not guaranteed that any crawler truly obeys the instructions in this file.
Using meta tags such as
noindex is a way to instruct crawlers to explicitly remove pages from search results. These tags can specified on a page-by-page basis and thus allow more fine-grained control over your content.
There are also other tags, such as
noarchive which are usually used in combination with each other.
<meta name="robots" content="noindex, nofollow,noarchive" />
Again, it is not guaranteed that all search engine crawlers support or obey these instructions. If you really want to hide your content, use password protection.
Also, as a general heads-up: Using
robots.txt and meta tags in combination can prove self-disrupting. A
noindex tag on a page that is blacklisted via
robots.txt will render it useless, because any respectable crawler will not come to reach this statement at all and might therefore still include this page in its search results.
robots.txt in Blazor
The easiest way to serve a
robots.txt file in your Blazor application is to use a middleware. We will have:
robots.txt with the desired content placed in the
wwwroot folder of our Blazor project.
RobotsMiddleware class serving exactly this file upon request.
Program.cs registering this middleware in our project.
First, we will place the actual
txt file in the content folder. For starters, the content can be something like this
# An example robots.txt allowing everything
Second, we implement our middleware class like this:
public class RobotsMiddleware
private const string RobotsFileName = "robots.txt";
private static readonly string RobotsFilePath = Path.Combine(Directory.GetCurrentDirectory(), "wwwroot", RobotsFileName);
private readonly RequestDelegate next;
public RobotsMiddleware(RequestDelegate next)
this.next = next;
public async Task InvokeAsync(HttpContext context)
if (context.Request.Path.StartsWithSegments("/" + RobotsFileName))
var output = await File.ReadAllTextAsync(RobotsFilePath);
context.Response.ContentType = "text/plain";
This is a standard middleware implementation that is hit on every request made to our web application.
If the request path matches with the
robots.txt file name, it will read our file, write it to the response and terminate the pipeline. If it does not, the next element in the pipeline will be called.
Now, this implementation is not terribly efficient when your application is hit by many crawlers in a short amount of time because it is doing file operations for every matching request. As homework, you might want to have a look at
IMemoryCache to improve it ;-).
Lastly, we need to add a single line to our
Program.cs to actually register our middleware:
And that is it. If you now run the application and navigate the corresponding path, you will be served with the content of the file in your project. Updating its content will also update what crawlers (and users) will see, just like it would on classic web application servers.
Please note that with this approach you could also serve other static or dynamic files that are usually found on web pages such as a sitemap or feeds. The middleware approach lets you fully customize those responses in a language you are familiar with.
<head> content in Blazor
To gain more fine-grained control over the behavior of crawlers on your pages, we can also leverage adding specific tags to them separately.
To do this, Blazor has the concept of so-called outlets, which are prebuilt for both
head of your pages. By opening the
_Hosts.cshtml of any of your recent Blazor projects, you will notice a line like this:
<component type="typeof(HeadOutlet)" render-mode="ServerPrerendered" />
This is where Blazor will render page-specific header content for you if it is defined.
To do so, you can place the Blazor tag
HeadContent on any of your pages:
<meta name="robots" content="noindex, nofollow,noarchive" />
After running the application and navigating to this site, you should be able to find this tag inside the header content of the page. If you want to find out more about the specific robot tags, head over to the Google Developers Guide.
Now, one might be tempted to put this snipped inside of a Blazor component and reuse it, and maybe put other tags in different components to reuse those as well. But be careful: Blazor will only render the latest set of
HeadContent and not magically merge all of your calls together.
I hope this article gave a short idea of how to implement both serving
robots.txt and page-specific tags in your Blazor application to direct search engine crawlers in the right direction.
As the coding part in this article is rather short, I do not provide a GitHub repository this time.
Be the first to know when a new post was released