Data science cat and dog

Andrew Russell Green

Research, data science and software portfolio

Data science cat and dog

Andrew Russell Green

Research, data science and software portfolio

Semantic Web, NLP and Archives
Semantic Web, NLP and Archives

An interdisciplinary project to model archival metadata using the Semanitc Web, provide open access to cultural heritage, and incorporate natural language processing (NLP) in search.

Skills used
Product management
Co-design
Semantic Web
Data modeling
NLP
Usability testing
Interdisciplinary research
Archival science
Java
SPARQL
Writing

This was a years-long, multifaceted, interdisciplinary research and development project. For the project's duration, I was the de facto product manager. My role also included software development and research.

The project sought to develop a free software system for documenting and disseminating historical archives, especially photographs.

We used the Semantic Web to model archival metadata following the ISAD-G standard, and created software for publishing image archives on the Web. For search, the system generated natural language descriptions of search results based on Semantic Web triples.

Other goals of the project included the development of methodologies for using audiovisual documents as primary sources in research, and the promotion of open access to cultural heritage. Most of my publications prior to 2024 are related to this project.

The project was part of the Audiovisual Laboratory for Social Research at the Instituto Mora, a public research institute in Mexico City, Mexico.

This online image archive uses the software we developed, with content also produced by the same project.

Product Management and Co-Design

I performed numerous tasks that are typically done by product managers, including studying user requirements, engaging with stakeholders, prioritizing features, planning rollouts, and writing non-technical explanations of technical topics.

The main fields in this interdisciplinary research were Social Science, Archival Science and Computer Science; I provided the bridge between Computer Science and the other areas. Stakeholders included researchers, students, archivists and archives. In many ways, our approach was a form of co-design.

In addition to product management, research and software development, I also conducted usability tests.

Semantic Web, Data Modeling and NLP

The Semantic Web provided the base format for modeling metadata in this project. We designed metadata for archival images, focusing on their use as primary sources in social science research.

The process of formulating this archival metadata using Semantic Web triples led to new understandings about the images and the social contexts in which they were produced.

The screenshot below also shows how the Semantic Web supported natural language generation. In a search for the word “Angel”, the system identified photographs of a place called “San Ángel”, a map produced by someone named “Ángel”, and objects with "Angel” in various other metadata fields. In the left panel are natural language descriptions of groups in the search results, generated from paths in the graph of Semantic Web triples.

Open Access, Knowledge and Cultural Heritage

We promoted the idea that archival images should be widely disseminated under open licenses, to help unlock their potential as sources of knowledge about society.

This ran counter to prevailing practices in social science at the time. Many institutions limited access to primary sources, and, in addition, often only written documents were seen as legitimate sources.

The project also promoted a knowledge-centric view of cultural heritage.

Publications

Over the years, quite a few publications came out of this project, including an award-winning book about research methodology and papers on natural language processing and cultural heritage.