Heritrix
Heritrix is an open-source web crawler, allowing users to target websites they wish to include in a collection and to harvest an instance of each site. The software is most often used as a powerful back-end tool incorporated into a web archiving workflow.
Provider
Internet Archive
Licensing and cost
Apache License, Version 2.0 – free. Some individual source code files are subject to or offered under other licenses.
Development activity
Version 3.1.1 was released in May 2012.
Heritrix powers the Internet Archive, and so receives ongoing support.
Platform and interoperability
As a Java application, Heritrix is theoretically platform agnostic; however, only Linux is supported. The software requires Java Runtime Environment 1.6 or higher, and at least 256MB of available RAM.
Functional notes
Web crawls are carried out by configuring a ‘job,’ which itself is an instance of a crawl template called a ‘profile.’ Although they contain the same configurations, these two entities have different functions; profiles record the set of configurations and act as a starting point for shaping a new job, but only the job itself can excecute a crawl.
The software will crawl FTP sites in addition to HTTP. Users can examine the results of a crawl by opening its log files, which include information about crawl problems and errors, each URI that was collected, and statistics about the job as a whole. Users can also create reports showing a summary of the crawl’s activity.
Heritrix stores the web resources it crawls in an Arc file. The software includes a command-line tool called arcreader which can be used to extract the contents.
Documentation and user support
The User Guide for versions 3.0 and 3.1 is in the form of a wiki, which at time of writing is not structured in any obvious narrative order; while detailed, it is very difficult to navigate. The User Manual for version 2.0 is structured and can be used as a reference for navigation. Extensive documentation is available, including release notes, Javadoc API documentation, and FAQs linking within the wiki.
Heritrix’s website links to two active mailing lists: a yahoo discussion group and a sourceforge list distributing source code commits. The project also uses a public JIRA for bug, feature, and issue tracking.
Usability
Heritrix is installed via a command line interface, but once installed the user can launch a web-based interface for configuration. Setting up a crawl requires a significant number of adjustments.
Expertise required
Installation requires solid knowledge of Linux and command line interfaces. As with any web archiving software, deep understanding of the project’s scope and collections policy is essential in order to set up appropriate targets.
Standards compliance
Heritrix does not offer metadata support. The software is designed to respect robots.txt exclusion directives and META robots tags.
Influence and take-up
Heritrix is extremely influential; as of March 2012 the sourceforge site reports nearly 240,000 downloads. Users include the Internet Archive, The British Library, the United States Library of Congress, and the French National Library. The software powers Netarchive Suite and the Web Curator Tool.
- Home
- Digital curation
- About us
- News
- Events
- Resources
- Briefing Papers
- Introduction to Curation
- Annotation
- Appraisal and Selection
- Curating Emails
- Curating e-Science Data
- Curating Geospatial Data
- Data Accreditation
- Data Citation and Linking
- Data Protection
- Database Archiving
- Digital Repositories
- Freedom of Information
- Genre Classification
- Interoperability
- Persistent Identifiers
- Trust Through Self Assessment
- Using OAIS for Curation
- Web 2.0
- What is Digital Curation?
- Common Directions in Research Data Policy
- 5 Steps to Research Data Readiness
- Citizen Science
- Making the Case for RDM
- Legal Watch Papers
- Standards Watch Papers
- Technology Watch Papers
- Introduction to Curation
- How-to Guides & Checklists
- Appraise & Select Research Data for Curation
- Cite Datasets and Link to Publications
- Develop RDM Services
- Develop a DMP
- Discover Requirements
- Five Steps to Decide What Data to Keep
- Five Things You Need to Know About RDM and the Law
- License Research Data
- Track Data Impact with Metrics
- Using RISE
- Where to keep research data
- Write a Lay Summary
- Developing RDM Services
- Reviewing research data platform capabilities at CISER
- Using EPrints to Build a Repository for UEL
- Assigning DOIs at Bristol
- DMPs in the Arts and Humanities
- Improving RDM at Monash
- Improving Research Visibility
- Increasing Participation in Training
- RDM Training for Librarians
- RDM strategy: moving from plans to action
- Storing and Sharing Data in Hull
- Curation Lifecycle Model
- Curation Reference Manual
- Peer review
- Editorial Board
- Completed chapters
- Appraisal and Selection
- Archival Metadata
- Archiving Web Resources
- Automated Metadata Generation
- Curating Emails
- File Formats
- Investment in an Intangible Asset
- Learning Object Metadata
- Metadata
- Ontologies
- Open Source for Digital Curation
- Preservation Metadata
- Preservation Scenarios for Projects Producing Digital Resources
- Preservation Strategies
- Principles for Enabling Access to Engineering Design Information Through Life
- Scientific Metadata
- The Role of Microfilm in Digital Preservation
- Chapters in production
- Policy and legal
- Data Management Plans
- Tools
- Case studies
- Repository audit and assessment
- Standards
- Publications and presentations
- Roles
- Curation journals
- Informatics research
- External resources
- Online Store
- Briefing Papers
- Training
- Projects
- Community
- Tailored support
