One thing that took me by surprise when I started researching SEO was that when a user enters a search term, the results are gathered from Google’s representation of the web not the entire web. For a page to be included in its index, Google must have already parsed, and stored the page’s contents in its databases.
To do this, automated robots known as spiders or crawlers scan the internet for links leading to pages they can index. These crawlers will begin scanning one page, then follow the links they find to then scan and index those pages.
This pattern repeats until the search engine has indexed a sizable representation of the web. It stores the meta information and text it finds on each page in their own databases and it is this data they use to generate the search engine ranking pages displayed to users.
Having a website online will not guarantee Google will find your site and include all pages in its rankings. It needs to either find each page through outbound and inbound links, the website’s own sitemap or through manual submission to Google. Eventbrite relies on a mixture of these strategies to make sure our pages are included in Google’s index of the web.
Inbound links are links from other domains that point to your website. Once Google crawlers land on a page, they quickly parse its content including any links that do not specifically tell search engines to ignore them. If website A includes a link to website B Google will follow the link to Website B after it is done parsing website A. The more external sites that link to your site the better chance Google has of indexing your pages.
Inbound links also play a large part in increasing a site’s relevancy and authority. Google’s main aim is to treat each web page as a user would. Therefore they deem pages that have a lot of natural outbound links as popular and increase their ranking in relevant search results. These links must occur naturally though as Google is known to decrease a page’s rank or remove them from their index entirely if the majority of their inbound links are from low authority or irrelevant pages.
Links to our event pages are often included on our organizer’s own sites which are indexed by Google. We also rely on press releases, news articles and blogs to link to these event pages when covering the event. The more links we are able to accrue from outside sources the higher our authority score is. This boosts all Eventbrite pages as Google deems the site trustworthy and popular based on the pages linking to our site.
Once Google has landed on an Eventbrite page, we use internal linking to direct crawlers to other pages we want indexed by Google. We utilize our most popular pages to point to other internal pages we want both users and Google to find. Our homepage is a popular entry point for users therefore Google views any internal links found on the page as important for parsing and indexing. We take advantage of this by listing popular events and links to our category search pages.
We also take a lot of care curating links within our footer as they are shown on each page of our site and is a good indicator to Google that these links are important. Some of our links are dynamic within the footer depending on the top-level domain (TLD) visited. A user visiting eventbrite.com will see links to American cities in our footer whereas users visiting eventbrite.com.au will see Australian cities.
We also use breadcrumbs on our public event pages to link to city and category directory pages. Not only does it provide another place for Google to find these pages, but it also allows users to jump quickly to other events similar to the current event page they are visiting.
A sitemap is a file, or multiple files, that provide a navigation for search engines to find all pages for a site. While it doesn’t replace linking, it does help crawlers find pages they might have missed due to orphaning and the absence of interlinking. Sitemaps also pass along useful metadata about each url, including when it was last modified and how often the page may change. While you will mainly see sitemaps as XML files, text and RSS file types are also accepted by Google.
For large sites, it is best to break up sitemaps as Google has a limit of 50,000 urls and a file size limit of 10MB uncompressed. You can then place the url to your smaller sitemaps into a sitemap index file. This is the approach we take at Eventbrite as we have over 10 million pages and growing.
Our main sitemap index holds links to the sitemaps for event pages, directory pages, venue profile pages and organizer pages, with information on when the sitemap was last modified. Each sitemap then has information on its priority. This gives Google an indication on how often it should come back to index new pages.
Keep in mind that including a link in the sitemap will not guarantee that Google’s crawlers will index and parse that page. Sitemaps merely suggest links for search engines to index and should not replace linking practices.
For new sites, it is unrealistic to expect Google crawlers to find their pages through outbound links. Google allows you to manually submit either a single page or sitemap through Google Webmaster Tools Seach Console. Again, it is Google’s discretion whether it will crawl and index these pages or not. You can also submit new pages through Google webmaster tools.
Google Crawl Budget
Google sets a crawl limit, also known as a budget, on each website. Every website has a different crawl budget closely linked to its page rank. This means the more Google deems your site as relevant and important the more time it will spend crawling and indexing your pages each time it visits your site.
Determining factors Google uses to set your crawl budget are your authority score, how often your site is updated, the frequency of new pages added and individual page speed and size. To increase the amount of pages Google indexes on each visit, make sure you reduce broken links as this is a waste of time and the crawler will have no further links to follow. You should also make sure there are no redirect loops. Redirect loops are where website A redirects to website B that then redirects back to website A. The crawler will be stuck in a loop when it could have been indexing other pages on your site.
Also utilize your robots.txt file and determine which pages are not important enough or have low quality, and add a rule to disallow crawlers from following and indexing these pages or directories. Eventbrite has over 10 million pages but only 1.5 million are included in Google’s index. We pay close attention to pages that are of low quality content, spam, dated etc. and restrict Google from indexing these pages. We also place links we deem as important as close to the homepage or easily accessible by our global navigation. A well thought out site hierarchy is key to making sure priority pages are indexed and reindexed frequently.
With over 40 billion web pages on the internet, Google often needs a hand to find new websites and pages. An estimated number of pages indexed by Google is 10% of pages on the web. It is important to remember that when a user enters a search term in Google the pages searched are not the entire web but Google’s representation of the web. Results returned are those that Google has found and stored in their large databases.
You should not rely solely on one strategy to improve the chances of Google parsing and indexing all pages on your site. A clear and well thought out site hierarchy is important with all pages linked at least once internally. Sitemaps are a great starting point for Google to find your pages and manual submission is important for new pages that are of high priority.
As your site grows and it receives more inbound links, Google will prioritize indexing new pages as it wants the most relevant and popular pages appearing in search results. Including content that will draw users to your site will also increase your presence on search engines. Here at Eventbrite we live by the motto what is good for SEO should be good for user experience too.