Digitale Kuratierungstechnologien #DKT15
Fachtagung in der Humboldt-Universität zu Berlin am 6. Oktober 2015 mehr
This report summarizes contributions and major outcomes of the Semantic Media Web event that took place 26.-27. September 2013 in Berlin. The report focuses on the second day of the event and the topic of
metadata for digital publishing. The event was organized by Xinnovations and W3C Germany and Austria Office. NOTE: the event was mostly held in German. Hence, the linked presentations for the first day are in German; a certain number of the linked presentations for the second day are in English. German slides are highlighted in blue, English slides are highlighted in green.
The topic of the first day was
multimedia archives and semantic technologies. The introductory presentations from Felix Sasaki (slides) and Adrian Paschke (slides) emphasized that the basic building blocks of semantic technologies, that is the core set of Semantic Web technology, is stable. The focus now is on applications actually using these technologies e.g. in the area of multimedia content. Adrian Paschke introduced the term
pragmatic Web as a label for the next, usage centred aim of research.
In the first session, Felix Daub showed an example how semantic technologies can foster the creation of new business models for multimedia content. Rolf Fricke (slides) introduced a technical building block for re-purposing multimedia content: fragment identifiers can be used to create fragments of multimedia objects. The MediaMixer project explores applications of media fragments.
The next session focused on the movie industry and introduced the upcoming project D-Werft. Funded by the Federal Ministry of Education and Research, D-Werft responds to use cases in the movie industry. In his presentation, Uli Kunz (slides) explained that during movie production a great variety of metadata is being created - but rarely deployed. Harald Sack explained that D-Werft will provide technologies for
Linked Production Data to make use of this metadata. Application scenarios include semantic and explorative search or intelligent recommendation systems.
Semantic Story Telling was the topic of the next session. Jürgen Keiper described the use case: stories including multimedia content should not be hardwired but created in a dynamic manner, taking linked data resources, user preferences and other information into account. Thomas Hoppe (slides) introduced technical infrastructure that could be used to realize this vision. Armin Berger (slides) emphasized the user perspective: to help creating semantic stories, new interfaces and concepts how to interact with content and linked resources are needed.
In the session
multimedia mashups, Georg Rehm (slides) introduced META-SHARE, a network of repositories of language data, tools and related web services. The motivation behind META-SHARE serves as an example relevant for all semantic technology applications: how can resource be made available in a sustainable and also legally transparent manner? Heiko Ehrig (slides) gave the final presentation of the first day discussing a hot topic: he introduced use cases for text mining and information extraction in the context of the German federal election 2013.
The second day focused on the topic of digital publishing and metadata. Bernard Gidon (slides) introduced the W3C Digital Publishing Activity. The activity started this year; in the Digital Publishing Interest Group topics like the relation between EPUB3 and HTML5 or internationalization of eBooks are discussed by a wide range of participants. The topic of metadata and digital publishing is not yet in scope for W3C. Below are key messages of presentations that can give valuable input from a German region perspective, how or whether to address this topic.
The first session discussed metadata from the perspective of various application scenarios. Michael Steidl (slides) talked from the point of view of news agencies and news industry vendors. Michael brought up fundamental questions: what is a metadata vocabulary? A list of hand-crafted terms, of concepts, or auto generated items can be regarded as metadata, but there are quite different usage scenarios or approaches to create a monetary benefit from the metadata. A question that came up later in the workshop again was how to create relations between metadata vocabularies: should one use a pivot vocabulary or create decentralized, direct mappings? Also, metadata vocabularies always have a cultural aspect: a general, culture & language agnostic vocabulary cannot be created.
The value of auto-generating metadata was demonstrated by Stefan Geißler. The main application scenario was automatic semantic enrichment of content. Academic publishers use these methods for existing content to improve content workflows and to produce semantic products providing new ways of access to content. With the vast amounts of material to process in the academic publishing realm, manual annotation does not scale. But what automatic methods to deploy or not, depends also on the target industry. Hence, a set of components for metadata generation and automatic content processing seems useful.
Thomas Hoppe (slides) argued that besides metadata schemes the vocabulary determines what can be expressed in annotations. In several important business to consumer application domains, e.g. about HR, products or news, appropriate vocabularies are still missing to bridge differences in the language utterance of authors and users. He argued that neither existing normative controlled classification systems nor community developed folksonomies are sufficient. The former are to general and to restrictive, the latter leading to incomplete and unreliable mappings for commercial exploitation.
The presentation from Gerard Kuys (slides) served as a bridge to the next session. He focused on using metadata in a cultural heritage scenario. A huge amount of metadata vocabularies has been created in the cultural heritage realm. This resulted in hand crafted, high quality topic indexing, but at the same time issues of metadata interoperability and linking between vocabularies are yet to be resolved. A promising application scenario of metadata in the cultural heritage realm and beyond is the semantic storytelling approach discussed during the first day, with the example of descriptions of Dutch history generated from DBPedia for educational purposes.
The cultural heritage session started with a presentation from Jana Kittelmann (slides). She explained the portal http://www.pueckler-digital.de/, an example of how to create additional value for archival artefacts available on the Web. The hypertextual, navigational representation on the Web itself is key for this scenario. Metadata comes into play in two areas: exporting the portal specific metadata into a central catalogue system to allow usage for researchers in general, and representation of metadata as RDF for future interlinkage with other linked data resources. In summary, the portal showed a real example in what direction the forehand mentioned semantic storytelling approach could go.
The German National Library (DNB) is a key provider of metadata in the German cultural heritage community. Alexander Haffner (slides) from DNB presented the project IN2N. It aims at providing access to the metadata treasure even outside the cultural heritage community. To this end, the core data set GND will be made available across domains. The presentation raised awareness for an important finding of the
Semantic Media Web event: the value of metadata depends on how others can make use of it. But to make this happen in the library domain, help of actors from the publishing industry is needed.
Complementary to the German library perspective, Antoine Isaac (slides) remotely presented on behalf of himself and Stefan Gradmann the European digital library Europeana. Metadata in Europeana is formulated in terms of the Europeana Data Model (EDM): a semantic web based metadata vocabulary for describing cultural artefacts. A key design premise of EDM is that it re-uses existing vocabularies from inside and outside the cultural heritage realm. The extensibility of EDM is demonstrated e.g. in the DM2E project that works on data aggregation and developing rich services for the digital humanities. From an Europeana perspective, W3C is setting the scene of metadata vocabularies e.g. with vocabularies in the realm of eGovernment like DCAT or the Organization Ontology. But there is also a need for domain expertise and room for standardization organizations in communities like the cultural heritage domain.
Ina Blümel (slides) emphasized the value of metadata about named entities like persons, locations, events or organizations for accessing the growing amount of digital information in the cultural heritage realm. Automatic content enrichment methods are inevitable for dealing with the mass of information. Enrichment with references to linked data sources can help to provide contextualization of cultural artefacts. The presentation provided examples of such artefacts from the Competence Centre for Non-Textual Materials, building a bridge to the first day of the event.
Ernesto De Luca introduced a project and tooling in the realm of restauration. Here a lot of entities are identified: objects of the restauration, materials used, methodologies etc. The presentation introduced work on an ontology for restaurant with the aim to keep this valuable information together for re-use by others in the restauration community and beyond. The representation as RDF helps to interlink existing restauration data and in this way create additional value beyond the restauration domain.
Steffen Meier (slides; see also his Manifesto of Ignorance, in German and on the same page in English) presented the point of view of AKEP (work group for electronic publishing). Metadata issues are both essential and a pain for the publishing industry. Three relevant aspects can be identified. First, metadata is important as a means of marketing, i.e. for discoverability, so that books are actually being bought. Second, more and more there is no interest in a publication to be bought by large audiences, but rather in smaller content items created for a long tail of many customers. To put it differently, there is micro-interest and micro-audiences. Knowledge about the customers is key for success in this new scenario. Third, to be able to process content in response to the other two aspects, means of automatic content indexing are essential. Publishing houses need to make huge investments to be able to take up the new opportunities. But for the large group of SME publishers, the step from printed products to data structures is hard to take. Meier summarized the situation as
publishers are storytellers, not computer scientists.
Bettina de Keijzer (slides) described the role of metadata for the large academic publishing house De Gruyter. Metadata is crucial as an instrument for customers to find and navigate content and is therefore used for integration with libraries and for discoverability e.g. in semantic search scenarios. Such a scenario can be realized if the publishing house is in control of the complete metadata value chain, including metadata manual and automatic production, semantic enrichment, and distribution. The goal is to achieve interoperability between publishing houses and their diverse costumer groups. Catering to a global market, there are not just specific metadata requirements for various use cases but also regional differences. The alignment of metadata formats amongst publishers, distributors and customers would be a means to move the role of metadata on the Web forward.
Ronald Schild (slides) shed a light on the different types of metadata that are relevant for publishers. There is no question that descriptive metadata is relevant for booksellers. But the potential buyer of a book will not search for the descriptive metadata items. The metadata needed for actually selling books is created in marketing departments who are organizing web shops. The result is highly dynamic metadata, that is: less hardwired categorization of content. A potential buyer will formulate queries that can be answered relying on automatic indexing and information extraction. In a sales situation there might be a larger amount of dynamically generated metadata than the book content itself. As a result, data collection & data integration are core tasks, including statistics about search behaviour of customers, market data etc.
Clemens Weins discussed metadata to support multilingual content creation. This is a complex process, including many participants and software systems. The Internationalization Tag Set (ITS) 2.0 specification provides metadata items (so called
data categories) to help in this process. ITS 2.0 is closely aligned with Web technologies like HTML5, see e.g. the HTML5 translate attribute. The usefulness of ITS 2.0 has been demonstrated with a real translation showcase involving the VDMA publishing house.
In several contributions, the final session took up the issue of mapping between vocabularies (pivot vocabularies or direct mapping) that had been discussed in the first session.
Robert Tolksdorf (slides) discussed the development of metadata vocabularies in general. For historical reasons even in one application area like Web search, there are competing metadata schemes for similar topic domains. The presentation provided examples how to make use of this variety. In the process of metadata vocabulary development, over time pivot vocabularies emerge naturally, while domain specific vocabularies play a role in the related communities.
Sebastian Hellmann introduced the linked data community effort DBPedia. The DBPedia ontology is maintained by a community of ontology authors. There is no pivot vocabulary: mappings between vocabularies are created directly. Currently, the community is exploring how to both maintain the ontology and at the same time provide tooling and a methodology for ontology customization and fusion with data from specialized domains and for selected use cases. For these goals, the GitHub platform and its branching model may provide the adequate infrastructure.
Daniel Kinzler introduced Wikidata. Like the DBPedia ontology, the Wikidata vocabulary is created via a community effort. It provides a large set of structured concepts that also serve as a bridge between languages. Wikidata can be regarded as complementary to DBPedia. In the past concepts in DBPedia were created by scraping infoboxes. Wikidata bears the potential of replacing this error prone effort by providing hand crafted concepts, evaluated by the community. A mapping between Wikidata and DBPedia is already under discussion.
Mathias Schindler, project manager at Wikimedia Deutschland, introduced the usage of controlled vocabularies in Wikipedia. His presentation fit nicely to other contributions: GND discussed by Alexander Haffner is often used in the German edition of Wikipedia as a link target, originating in Wikipedia pages specific to persons or other types of entities. The effort is also complementary to the forehand mentioned approaches of DBPedia and Wikidata: they focus on metadata outside the actual textual content; the GND links are introduced by authors in a dedicated section at the end of a given Wikipedia page. A challenge is now how to bring these various metadata resources together and how to address issues like metadata maintenance and sustainability.
The presentation of Janine Lantzsch (slides) summarized the outcome of a student project undertaken within the Information Sciences department at the University of Applied Sciences Potsdam. The students were asked to imagine a multimedia content provider who wants to enrich content with metadata based on Schema.org. The challenge was that Schema.org is not tailored towards the multimedia domain. The students were asked to examine existing multimedia metadata vocabularies as potential means to extend Schema.org. Issue like how to evaluate the quality of metadata vocabularies, or what parts of a given vocabulary one actually should choose for integration into Schema.org came up. The conclusion from the project was that for the given use case, applying Schema.org as a pivot vocabulary may not be the right approach. There was no discussion about the high value Schema.org provides via its general metadata definitions. Nevertheless, for the given task, the approach of mapping domain specific vocabularies directly or via a rather specialized pivot vocabulary like the Ontology for Media Resources seems to be more promising.
There was no time for a discussion among the participations to draw conclusions. Reviewing the contributions, it seems that a static pivot metadata vocabulary for digital publishing would not provide immediate benefit in the highly dynamic metadata value chain. But besides the re-occurring usage scenario of semantic story telling which also served as a bridge between the two days of the event, certain dichotomies in handling metadata could be identified, e.g.: automatic generation versus handcrafted vocabularies; stable descriptive metadata to handle workflows in the publishing domain versus dynamically generated metadata for marketing content; and domain specific metadata in the hand of communities like cultural heritage or Wikipedia versus rather centralized metadata for general consumption on the Web.
Future work on metadata and digital publishing may help to detect issues with these dichotomies; the alignment of metadata formats in the complete workflow amongst publishers, distributors and customers, as well as metadata usage scenario specific guidelines for mapping metadata may be of value for publishing on the Web at large.
Fachtagung in der Humboldt-Universität zu Berlin am 6. Oktober 2015 mehr
Mit einem Aufruf zur Kooperation lädt der Xinnovations e. V. vom 6. bis 7. Oktober 2014 zum nächsten öffentlichen Statusmeeting des Innovationsforums Semantic Media Web. Themenschwerpunkt sind digitales Kuratieren und Corporate Smart Content. mehr