Data, repositories and Google
7 March, 2008
In a post last year, Peter Murray Rust criticised DSpace as a place to keep data:
There's real data here [trust me: INCHI and SMILE at least, plus bond strengths etc] that could be indexed but isn't. The point is, surely, that this would be just as much a problem if the repository was simply a filestore full of CML files, which is how data is often made available. But unlike the filestore, there is usually some useful metadata in the repository which can assist data users (ie people, in this case); in a filestore, this is either absent, encoded in filenames, or in some conventional place such as README.TXT where it's relation to the actual data file is problematic).
So: in the first place, Google et al are unlikely to index data, particularly unusual data types. And in the second place, repositories encourage metadata, which does get indexed. So from this point of view at least, a repository may provide better exposure for your data (and hence more data re-use) than simply making the files web-accessible.
This doesn't mean that current, library-oriented repositories are yet fit for purpose for science data! Far from it...
"The search engines locate content. Try searching for NSC383501 (the entry for a molecule from the NCI) and you’ll find: DSpace at Cambridge: NSC383501Peter isn't often wrong, but in this case it was clear from comments to his post that Google does normally index DSpace content, not just the metadata. There were a couple of reasons for the effects Peter saw, but the key one related to the nature of the data. Jim Downing wrote, for example:
"But the actual data itself (some of which is textual metadata) is not accessible to search engines so isn’t indexed. So if you know how to look for it through the ID, fine. If you don’t you won’t. [...]
"So (unless I’m wrong and please correct me), deposition in DSpace does NOT allow Google to index the text that it would expose on normal web pages. [...]
"If this is true, then repositing at the moment may archive the data but it hides it from public view except to diligent humans. So people are simply not seeing the benefit of repositing - they don’t discover material though simple searches."
"Not sure what to tell you about your ChemML files. Possibly Google doesn’t know what to do with them and doesn’t try?The data Peter refers to is Chemical Markup Language data in a file with extension .cml. My Mac does not know what it is, and I guess no more does Google… unless perhaps you tell Google that it’s text, as Jim Downing seemed to be suggesting in his comment (I’m not sure this constitutes lying, more selective use of the truth). I can open CML files in my text editor, fine, although of course to process them into something chemically interesting, I would need some additional software or plugins… Here's a chunk of that file [sorry, tried to include some XML here but Blogger swallowed it up]...
"That’s my understanding - interestingly, if you lie about the MIME type, Google does index CML (here, for example)."
There's real data here [trust me: INCHI and SMILE at least, plus bond strengths etc] that could be indexed but isn't. The point is, surely, that this would be just as much a problem if the repository was simply a filestore full of CML files, which is how data is often made available. But unlike the filestore, there is usually some useful metadata in the repository which can assist data users (ie people, in this case); in a filestore, this is either absent, encoded in filenames, or in some conventional place such as README.TXT where it's relation to the actual data file is problematic).
So: in the first place, Google et al are unlikely to index data, particularly unusual data types. And in the second place, repositories encourage metadata, which does get indexed. So from this point of view at least, a repository may provide better exposure for your data (and hence more data re-use) than simply making the files web-accessible.
This doesn't mean that current, library-oriented repositories are yet fit for purpose for science data! Far from it...
- Home
- Digital curation
- About us
- News
- Events
- Resources
- Briefing Papers
- Introduction to Curation
- Annotation
- Appraisal and Selection
- Curating Emails
- Curating e-Science Data
- Curating Geospatial Data
- Data Accreditation
- Data Citation and Linking
- Data Protection
- Database Archiving
- Digital Repositories
- Freedom of Information
- Genre Classification
- Interoperability
- Persistent Identifiers
- Trust Through Self Audit
- Using OAIS for Curation
- Web 2.0
- What is Digital Curation?
- Making the Case for RDM
- Research Data Readiness
- Legal Watch Papers
- Standards Watch Papers
- Technology Watch Papers
- Introduction to Curation
- How-to Guides
- Curation Reference Manual
- Peer review
- Editorial Board
- Completed chapters
- Appraisal and Selection
- Archival Metadata
- Archiving Web Resources
- Curating Emails
- File Formats
- Investment in an Intangible Asset
- Learning Object Metadata
- Metadata
- Ontologies
- Open Source for Digital Curation
- Preservation Metadata
- Preservation Strategies
- Principles for Enabling Access to Engineering Design Information Through Life
- Chapters in production
- Curation Lifecycle Model
- Policy and legal
- Data Management Plans
- Tools
- Case studies
- Repository audit and assessment
- Standards
- Publications and presentations
- Roles
- Curation journals
- Informatics research
- External resources
- Briefing Papers
- Training
- Projects
- Community
