Command Line
Note: We weren't able to give these commands a complete test during the final phase of development, so it's possible some of them may now be deprecated.
Though Combine is designed primarily as a GUI interface, the command line provides a potentially powerful and rich interface to the models and methods that make up the Combine data model. This documentation is meant to expose some of those patterns and conventions.
For all contexts, the OS combine user is assumed, using the Combine Miniconda python environment, which can be activated from any file path location by typing:
# become combine user, if not already
su combine
# activate combine python environment
source activate combine
There are few command line contexts:
- Django shell
- A shell that loads all Django models, with some additional methods for interacting with Jobs, Records, etc.
- Django management commands
- Combine specific actions that can be executed from bash shell, via Django’s manage.py
- Pyspark shell
- A pyspark shell that is useful for interacting with Jobs and Records via a spark context.
These are described in more detail below.
Django Python Shell
Starting
From the location /opt/combine run the following:
./runconsole.py
Useful and Example Commands
Convenient methods for retrieving instances of Organizations, Record Groups, Jobs, Records
Most of these expect a DB identifier for instance retrieval:
# retrieve Organization #14
org = get_o(14)
# retrieve Record Group #18
rg = get_rg(18)
# retrieve Job #308
j = get_j(308)
# retrieve Record by id '5ba45e3f01762c474340e4de'
r = get_r('5ba45e3f01762c474340e4de')
# confirm these retrievals
In [2]: org
Out[2]: < Organization: Organization: SuperOrg >
In [5]: rg
Out[5]: < RecordGroup: Record Group: TurboRG >
In [8]: j
Out[8]: < Job: TransformJob @ May. 30, 2018, 4:10:21 PM, Job #308, from Record Group: TurboRG >
In [10]: r
Out[10]: < Record: Record: 5ba45e3f01762c474340e4de, record_id: 0142feb40e122a7764e84630c0150f67, Job: MergeJob @ Sep. 21, 2018, 2:57:59 AM >
Loop through Records in Job and edit Document
This example shows how it would be possible to:
- retrieve a Job
- loop through Records of this Job
- alter Record, and save
This is not a terribly efficient way to do this, but it demonstrates the data model as accessible via the command line for Combine. A more efficient method would be to write a custom, Python snippet Transformation Scenario.
# retrieve Job model instance
In [3]: job = get_j(563)
# loop through records via get_records() method, updating record.document (replacing 'foo' with 'bar') and saving
In [5]: for record in job.get_records():
...: record.document = record.document.replace('foo', 'bar')
...: record.save()
Combine Django Management Commands
Combine Update
It’s possible to perform an update of Combine either by pulling changes to the current version (works best with dev and master branches), or by passing a specific release to update to (e.g. v0.3.3).
To update the current branch/release:
./manage.py update
To update to another branch / release tag, e.g. v0.3.3:
./manage.py update --release v0.3.3
The update management command also contains some “update code snippets” that are included with various releases to perform updates on models and pre-existing data where possible. An example is the update from v0.3.x to v0.4 that modified the job_details for all Transformation Jobs. Included in the update is a code snippet called v0_4__update_transform_job_details() that assists with this. While running the update script as outlined above, this code snippet will fire and update Transformation Jobs that do not meet the new data model.
These possible updates can be invoked without pulling changes or restarting any services by including the following flag:
./manage.py update --run_update_snippets_only
Or, if even more granular control is needed, and the name of the snippets are known – e.g. v0_4__update_transform_job_details – they can be run independently of others:
./manage.py update --run_update_snippet v0_4__update_transform_job_details
Full State Export
One pre-configured manage.py command is exportstate, which will trigger a full Combine state export (you can read more about those here). Though this could be done via the Django python shell, it was deemed helpful to expose an OS level, bash command such it could be fired via cron jobs, or other scripting. It makes for a convenient way to backup the majority of important data in a Combine instance.
Without any arguments, this will export all Organizations, Record Groups, Jobs, Records, and Configuration Scenarios (think OAI Endpoints, Transformations, Validations, etc.); effectively anything stored in databases. This does not include conigurations to localsettings.py, or other system configurations, but is instead meant to really export the current state of the application.
./manage.py exportstate
Users may also provide a string of JSON to skip specific model instances. This is somewhat experimental, and currently only works for Organizations, but it can be helpful if a particular Organization need not be exported. This skip_json argument is expecting Organization ids as integers; the following is an example if skipping Organization with id == 4:
./manage.py exportstate --skip_json '{"orgs":[4]}'
Pyspark Shell
The pyspark shell is an instance of Pyspark, with some configurations that allow for loading models from Combine.
Note: The pyspark shell requires the Hadoop Datanode and Namenode to be active. These are likely running by defult, but in the event they are not, they can be started with the following (Note: the trailing : is required, as that indicates a group of processes in Supervisor): sudo supervisorctl restart hdfs:
Note: The pyspark shell when invoked as described below, will be launched in the same Spark cluster that Combine’s Livy instance uses. Depending on avaialble resources, it’s likely that users will need to stop any active Livy sessions as outlined here to allow this pyspark shell the resources to run.
Starting
From the location /opt/combine run the following:
./pyspark_shell.sh
Useful and Example Commands
Open Records from a Job as a Pyspark DataFrame
# import some convenience variables, classes, and functions from core.spark.console
from core.spark.console import *
# retrieve Records from MySQL as pyspark DataFrame (In this example, retrieving records from Job #308. Also of note, must pass spark instance as first argument to convenience method, which is provided by pyspark context:)
job_df = get_job_as_df(spark, 308)
# confirm retrieval okay
job_df.count()
...
...
Out[5]: 250
# look at DataFrame columns
job_df.columns
Out[6]:
['id',
'combine_id',
'record_id',
'document',
'error',
'unique',
'unique_published',
'job_id',
'published',
'oai_set',
'success',
'valid',
'fingerprint']
Tests
Though Combine is by and large a Django application, it has characteristics that do not lend themselves towards using the built-in Django unit tests. Namely, DB tables that are not managed by Django, and as such, would not be created in the test DB scaffolding that Django tests usually use.
Instead, Combine uses out-of-the-box pytest for unit tests.
Demo data
In the directory /tests, some demo data is provided for simulating harvest, transform, merge, and publishing records.
- mods_250.xml - 250 MODS records, as returned from an OAI-PMH response
- during testing this file is parsed, and 250 discrete XML files are written to a temp location to be used for a test static XML harvest
- mods_transform.xsl - XSL transformation that performs transformations on the records from mods_250.xml
- during transformation, this XSL file is added as a temporary transformation scenario, then removed post-testing
Running tests
Note: Because Combine currently only allows one job to run at a time, and these tests are essentially small jobs that will be run, it is important that no other jobs are running in Combine while running tests.
Tests should be run from the root directory of Combine, if Ansible or Vagrant builds, likely at /opt/combine. Also requires sourcing the anaconda Combine environment with source activate combine.
Testing creates a test Organization, RecordGroup, and Job’s during testing. By default, these are removed after testing, but can be kept for viewing or analysis by including the flag --keep_records
.
Examples
# run tests, no output, create Livy session, destroy records
pytest
# run tests, see output, use active Livy session, keep records after test
pytest -s --keep_records
# run tests, ignore file test_basic.py, and ignore warnings
pytest -s --ignore=tests/test_basic.py -p no:warnings
Next: Background Tasks