Skip to content

Features in SolrWayback

Victor Harbo Johnston edited this page Dec 16, 2024 · 4 revisions

This wiki site contains an overview of features available in SolrWayback. The following features are described below:

Text Search

SolrWayback have many possibilities for discovery. One of these is free text search in all resources (HTML pages, PDFs, metadata for different media types, URLs, etc.).

List of search results with facets.

Interactive link graph (ingoing/outgoing) for domains.

A lot of tools are to be found in the SolrWayback toolbox. One of these is the interactive link graph tool. This tool can be used to visualise ingoing and outgoing links.

Interactive domain link graph

Domain wordclouds

Another tool found in the toolbox is the wordcloud generator. This tool can generate wordclouds from text on single domains.

A wordcloud for the domain youtube.com

N-gram visualisations

The toolbox also contains a tool for visualising search results as an n-gram graph.

n-gram visualization of results by year, relative to the number of results that year.

Visualisation of search result by domain.

Another way to visualise search results is by domain over time. SolrWayback also has a feature to analyse and visualise statistics on domain level. These statistics include the size of the domain and numbers of ingoing and outoging links.

Visualization of results by domain over time.

Image search

With a simple checkbox it is possible to gain access to an image search that only contains image results and show them in a way relatable to how Google Image Search presents images.

Image search, show only images as results.

Image geo search

Another image search capability is the image geo search, which searches images based on their GPS location.

Search in images by gps location in images having exif location information about the location.

Search by upload

Another way to search in SolrWayback is by uploading a file (e.g., image, PDF). By doing this you can check whether the file has been harvested and find HTML pages, that are using the uploaded file.

Export to WARC files

SolrWayback can export search results in multiple ways.

  • Search results can be exported to WARC files, which is done through a streaming download. This means that there is no limit to the size of the downloaded WARC file.
  • Text from search results can also be exported to CSV, where fields for export is customisable.
  • Large scale export of link graphs in Gephi format. (See https://labs.statsbiblioteket.dk/linkgraph/)

Alternative playback engine

In SolrWayback it is possible to configure an alternative playback engine. This can be done to utilise the search and discovery capabilities of SolrWayback and another engine for playback such as OpenWayback or pywb.

Memento API

SolrWayback supports the memento protocol. Mementos of a given URL can be found at timegates like this: /solrwayback/services/memento/{date}/url. Where the date can be left out to retrieve the newest memento in the archive. Dates can be specified as wayback dates on the following format: 20170101120000 and also as shorter dates with only year and month specified as an example: 201712.

The memento timemap API is also supported. The timemap API supports the following response formats:

  • link
  • json
  • spec

Link is the original format specified by memento. While the result returned with type=json is on the JSON format used by PyWB instances and archive.org. This format is a JSON array of arrays of content, which looks a lot like the response from a CDX index API call. When using the format spec a JSOn response following the format specified by memento is returned.

The timemap API can be queried at: /solrwayback/services/memento/timemap/{type}/url. The link and spec types supports paging of the result. If the result is paged, then this functionality can be used with the following URL: /solrwayback/services/memento/timemap/page/{type}/url