Automated Document Genre Classification Workshop: Supporting Digital Curation, Information Retrieval, and Knowledge Extraction

9 September 2009
Microsoft Research, Cambridge

Overview

In co-operation with the International Conference on the Theory of Information Retrieval (ICTIR) and Microsoft Research, Cambridge, UK, we held a workshop on Automated Document Genre Classification. This workshop was intended as a brainstorming session for building a research agenda for automated genre classification, identification and recognition that will enhance and support work flows within:

  • Digital curation and preservation
  • Information management
  • Information seeking, search, and retrieval
  • Information extraction and knowledge discovery

There is a lack of consensus in the genre classification research community on methods of genre taxonomy generation, evaluation, and applications of the study in existing systems. This event was intended to open up a discussion forum and identify:

  • How to constructively establish a useful genre taxonomy
  • How to integrate and apply genre classification within existing information systems
  • How to evaluate and consolidate its usefulness and effectiveness within these target systems.

This workshop brought together core people within genre classification research and the areas of research mentioned above to establish a research road map for bringing genre classification research to applicable maturity.

Workshop Materials

Motivation

The automation of metadata extraction is crucial to digital curation activities, as information deluge is likely to result in enormous costs in manual extraction. The organisation of documents into their genre classes that indicate the physical and conceptual structure of the text, could serve as a starting point for both automatic and manual extraction by narrowing down the possible areas within the text from which to extract the required information.

Collection profiling is an important aspect of risk assessment and data audit within organisational collections. Each organisation focuses on document genres strongly associated to the activities and services central to the organisation: e.g. a research article as a part of experimental research at a research centre; a report as part of a news coverage in a newspaper corporation; a financial budget report as part of a business venture in a company. The identification of core document genres could form building blocks for defining criteria for identifying risks to the collection that are cognizant of procedural context of the organisation.

Information retrieval techniques mostly rely on relevance measures calculated on the basis of the document's topical content. However, a document with the same topic may be created with different objectives and as part of different processes (e.g. research as opposed to product promotion) resulting in different levels of relevance, depth, usefulness, and reliability as a source of information. Genre classification (i.e. distinguishing an advertisement about a camera from a product review of the same camera) may be an effective method of supporting finer levels of granularity in relevance judgements.

Back to top

Tentative Programme

The workshop consisted of four sessions. The first three sessions comprised three presentations each from selected speakers, followed by discussion. The fourth session took the format of open discussion.

09:00 – 09:30 Registration
09:30 – 11:00 Session I: Understanding genre classification — building a taxonomy
11:00 – 11:15 Coffee
11:15 – 12:45 Session II: Role of genre classification in existing information systems
12:45 – 14:00 Lunch
14:00 – 15:30 Session III: Viability of evaluating the effectiveness and usefulness of genre classification
15:30 – 15:45 Coffee
15:45 – 16:45 Session IV: Building a research road map — open discussion and summary of previous sessions
16:45 – 17:00 Close

Back to top

Costs

This event cost £75.00.

Back to top