Sitemaps and Indexing
What is a sitemap?
A sitemap is a file that provides information about the pages, videos, and other files on a website, which helps the search engines like Google, Bing etc. to index the website correctly.
What is indexing and what are crawlers?
Crawlers are the automated programs or bots that a search engine uses to find and retrieve the web pages across the internet for indexing.
How does indexing work?
When a crawler visits a website, it discovers web pages by following links or using a sitemap provided by the website. It reads the page's HTML and content to understand the page and if the page is worth indexing based on the content quality, SEO optimization and other factors like page load speed etc. then the search engine adds it to its index.
A proper indexing is a must for SEO because even if a page is crawled, if it’s not indexed, it’s invisible to users searching for relevant information.
Important pointers to be considered for effective indexing:
- Submit a sitemap to search engines.
- Ensure your site is crawlable and any page which you want to index is not getting blocked via robots.txt
- Use relevant keywords and good structure so the content can be easily understood by both users and search engines.
- If your page have images, make sure they have alt text in it.
- Your pages should have proper metadata set.
- Duplicate pages or pages with similar content can be tagged with a "noindex" directive in robots.txt, which tells crawlers not to index them.
Approach 1: Sitemap Generation on Next.js Route
Process:
- Create a route for the sitemap in Next.js.
Create an API to fetch data (products, designers, etc.) from Elasticsearch on the backend.
Call the sitemap generation function in Next.js to create the sitemap dynamically.
There will be multiple sitemap creations, the main sitemap file will have top level sitemap URLs summarizing the main features like products, designers etc.
Upon clicking on the main sitemap url only top level sitemap will be generated.
Upon clicking on the child sitemap urls that particular sitemap data will be generated. For e.g. on the page products-sitemap.xml the product urls will be generated.
The page and code in next.js for the sitemap generation will be the same, the sitemap will be generated on the basis of the parameter passed in the url.
Pros:
The sitemap is always up-to-date, generated every time the URL is accessed.
Cons:
High processing load on the frontend, potentially impacting performance.
Approach 2: Sitemap Generation with Node.js
Process:
Install the sitemap package(https://www.npmjs.com/package/sitemap) on Node.js
Fetch data from Elasticsearch.
Use the sitemap package to create the sitemaps.
Replace the old sitemaps with the new ones in the Next.js directory.
Next.js renders these static sitemap files.
Set up a cron job to regenerate the sitemaps daily or as needed.
Pros:
Reduces processing load on the frontend.
Regularly updated sitemap with minimal impact on frontend performance.
Cons:
The backend handles the processing load.
Careful handling required when updating the Next.js directory.
Approach 3: Sitemap Generation with S3
Process:
Install the sitemap package on Node.js.
Fetch data from Elasticsearch.
Use the sitemap package to create the sitemaps.
Upload the new sitemaps to an S3 bucket, replacing the old ones.
Next.js fetches the sitemaps from the S3 bucket for rendering.
Set up a cron job to regenerate the sitemaps daily or as needed.
Pros:
Offloads processing from the frontend.
No need to update the Next.js directory directly.
Cons:
The backend still handles the processing load.
Requires frontend logic to fetch and render the sitemap from S3.
Approach 4: Using AWS Lambda for Sitemap Generation
Process:
Similar to Approaches 2 and 3, but with AWS Lambda handling the sitemap generation.
Data fetching from Elasticsearch and sitemap creation are handled in the Lambda function.
Upload the sitemaps to an S3 bucket.
Next.js fetches the sitemaps from S3.
Pros:
Offloads processing from both the backend server and frontend.
Scalable and serverless, reducing infrastructure overhead.
Cons:
Increased complexity due to Lambda function setup and maintenance.
Potential added costs associated with using AWS Lambda and S3.
NOTE: For all 4 approaches you can place the main sitemap URL in the robots.txt file for crawlers to find the sitemap path easily.
Each approach has its advantages and challenges. The choice depends on factors such as system architecture, processing load management, and desired complexity level.