Automated Document Genre Classification Workshop: Supporting Digital Curation, Information Retrieval, and Knowledge Extraction
Overview
In co-operation with the International Conference on the Theory of Information Retrieval (ICTIR) and Microsoft Research, Cambridge, UK, we held a workshop on Automated Document Genre Classification. This workshop was intended as a brainstorming session for building a research agenda for automated genre classification, identification and recognition that will enhance and support work flows within:
- Digital curation and preservation
- Information management
- Information seeking, search, and retrieval
- Information extraction and knowledge discovery
There is a lack of consensus in the genre classification research community on methods of genre taxonomy generation, evaluation, and applications of the study in existing systems. This event was intended to open up a discussion forum and identify:
- How to constructively establish a useful genre taxonomy
- How to integrate and apply genre classification within existing information systems
- How to evaluate and consolidate its usefulness and effectiveness within these target systems.
This workshop brought together core people within genre classification research and the areas of research mentioned above to establish a research road map for bringing genre classification research to applicable maturity.
Workshop Materials
- Final Workshop Programme [PDF, 61KB]
- Title Abstracts [PDF, 74KB]
- Presentations [ZIP, 4.51MB]
Motivation
The automation of metadata extraction is crucial to digital curation activities, as information deluge is likely to result in enormous costs in manual extraction. The organisation of documents into their genre classes that indicate the physical and conceptual structure of the text, could serve as a starting point for both automatic and manual extraction by narrowing down the possible areas within the text from which to extract the required information.
Collection profiling is an important aspect of risk assessment and data audit within organisational collections. Each organisation focuses on document genres strongly associated to the activities and services central to the organisation: e.g. a research article as a part of experimental research at a research centre; a report as part of a news coverage in a newspaper corporation; a financial budget report as part of a business venture in a company. The identification of core document genres could form building blocks for defining criteria for identifying risks to the collection that are cognizant of procedural context of the organisation.
Information retrieval techniques mostly rely on relevance measures calculated on the basis of the document's topical content. However, a document with the same topic may be created with different objectives and as part of different processes (e.g. research as opposed to product promotion) resulting in different levels of relevance, depth, usefulness, and reliability as a source of information. Genre classification (i.e. distinguishing an advertisement about a camera from a product review of the same camera) may be an effective method of supporting finer levels of granularity in relevance judgements.
Tentative Programme
The workshop consisted of four sessions. The first three sessions comprised three presentations each from selected speakers, followed by discussion. The fourth session took the format of open discussion.
| 09:00 – 09:30 | Registration |
| 09:30 – 11:00 | Session I: Understanding genre classification — building a taxonomy |
| 11:00 – 11:15 | Coffee |
| 11:15 – 12:45 | Session II: Role of genre classification in existing information systems |
| 12:45 – 14:00 | Lunch |
| 14:00 – 15:30 | Session III: Viability of evaluating the effectiveness and usefulness of genre classification |
| 15:30 – 15:45 | Coffee |
| 15:45 – 16:45 | Session IV: Building a research road map — open discussion and summary of previous sessions |
| 16:45 – 17:00 | Close |
Costs
This event cost £75.00.
- Home
- Digital Curation
- About Us
- News
- Events
- Resources
- Curation Reference Manual
- Curation Lifecycle Model
- Briefing Papers
- Introduction to Curation
- Annotation
- Appraisal and Selection
- Curating e-mails
- Curating e-science data
- Curating geospatial data
- Data accreditation
- Data protection
- Database archiving
- Digital repositories
- Freedom of Information
- Genre classification
- Interoperability
- Persistent Identifiers
- Trust through self audit
- Using OAIS for curation
- Web 2.0
- What is digital curation?
- Legal Watch Papers
- Standards Watch Papers
- Technology Watch Papers
- Introduction to Curation
- Policy and Legal
- Case Studies
- Tools and Applications
- Standards
- Publications
- External Resources
- Roles
- Curation Journals
- Training
- Projects
- Community
- Contact Us
IDCC 2010
IDCC 2010
Submission deadlines - now extended!
23 July 9 August - research papers and practitioner abstracts
01 September - poster/demo abstracts
31 October - final papers and posters deadline
Notification dates
17 September - authors of papers
01 October - authors of posters/demos
