Data Model

Overview

Combine’s Data Model can be broken down into the following hierarchy:

Home → Organizations --> RecordGroups --> Jobs --> Records

Organizations

“Organizations” is the highest layer in the data hierarchy of Combine (equivalent to a ‘Data Provider’ in REPOX). For the typical service hub, an ‘organization’ will correspond to a cultural institution. Combine was designed to be flexible within a complicated ecosystem of metadata providers and harvesters.

A single instance of Combine might have two Organizations:

Organizations contain Record Groups. The Organization “Foo University” would contain all Record Groups shared by Foo University; for example, a Fedora repository, an Omeka instance, and an aggregation of records hosted on behalf of a local historical society. Regardless of origin, if the Record Groups are all shared by a single institution, they can be logically grouped under the same Organization.

Record Groups

“Record Groups" are similar to “Data Sets” in REPOX.

A Record Group is a set of records grouped intellectually. Usually this means they are part of the same collection, but they could also be a set of records from different collections that are grouped together in order to share the same transformation or other process.

Taking our Foo University example above it would be reasonable to make the Fedora Repository, the Omeka exhibits, and the historical society records each a separate and distinct Record Group, three in all.

Record Groups contain Jobs.

Jobs

Jobs will be new for anyone accustomed to REPOX. A Record Group can, and usually does, contain several Jobs. A Job is an action taken on a Record Group and the results of that action. A Job can also be thought of as one step in a series of actions.

In a typical Record Group, you may see one Job that represents a harvest of records, another for a transformation of those records, and another for a different transformation. One of these jobs will probably be “Published” (meaning, added to Combine’s OAI feed; see “Part 13: Publishing Records” for more).

To give an example, let’s return to the Record Groups from Foo University and say that their Fedora repository includes a collection of photographs under the name “County Fair.”

Note how an early Job can become an input for later Jobs.

All the metadata for the “County Fair” collection can be found in each Job, but each set is at a different stage: harvest, transformation, and alternative transformation. Each Job is additive and does not delete or replace the data from an earlier Job unless the user takes special steps to do that. Combine errs on the side of duplicating data in order to preserve a record of provenance or lineage, and to maintain a standard of transparency regarding how and why a Record “downstream” looks the way it does.

Creating new Jobs while retaining older Jobs gives Combine’s Record Groups a kind of version control. Jobs from earlier harvests can be retained in order to preserve a history of that Record Group. If a problem emerges with a harvest, having older Jobs still available makes it possible to step back to an earlier version of that Record Group until the problem is identified and fixed.

The history of a Record Group might eventually look like this:

In this scenario, the Record Group has eight Jobs, but only Job08, the fourth “transformation” Job appears in Combine’s OAI feed.

There are three types of Jobs: Harvest, Transformation, and Merge/Duplicate.

Regardless of type, all Jobs contain Records.

Record

The most granular level of hierarchy in Combine is a single Record. Records are contained within Jobs.

Each Record’s XML content, and its other attributes, are recorded in MongoDB, while its indexed fields are stored in ElasticSearch.

Identifiers

A Record has three important identifiers:



Next: Configuration