There are two ways to make a domain crawl available for users: either by deploying and supporting an access system internally (usually Wayback or a variant) or by utilizing a hosted instance supported and maintained by IA, but potentially designed in accordance with a partner’s website. The first method is dependent upon the custodial institution and its resources and capabilities. The latter version is something that IA has done for domain-scale crawling partners and examples are provided below.
Some examples:
Site search: Includes both URL search as well as keyword search. Keywords are derived from the anchor text of all webpages linking to a host. Site search functionality is currently viewable in the new Wayback Machine at https://web.archive.org.
Media search: Media search takes an archived web media resource (such as an image) and “tokenizes” its URL name by turning the filename into individual words which then become the text for a search index. An example of URL tokenization search can be seen in GifCities, where the search engine is powered by the words in the (in this case) .gif filenames. Tokenization provides a way to allow for search of resources that themselves may contain no text.
All search indexing at the Internet Archive is done using ElasticSearch, an open-source and widely utilized search tool. ElasticSearch is used across the Internet Archive for both web and non-web search and includes and monitored and maintained search cluster for high performance and easy addition of multiple indicies.
Archive-It is our user-controlled web service for creating curated, publicly accessible web archives and born-digital collections.
Learn about Archive-ItVault is our low-cost, easy-to-use digital repository and preservation service to store, manage, and preserve digital files and collections.
Learn about VaultARCH is our research and education service that helps users easily build, access, and analyze digital collections computationally at scale.
Learn about ARCH