Supplejack : Installation Walk-through by Example

Dependencies

Be sure to read through the dependencies guide here before continuing the walkthrough.

Goal

By the end of this tutorial, you will have a working set up of Supplejack on your local machine. Note that these instructions are intended to be run on either Mac or Linux.

Step One - Setting up your working directories:

Create a folder called supplejack on your machine. Inside of this folder, clone down the following projects.

https://github.com/DigitalNZ/supplejack_manager as ‘manager’. https://github.com/DigitalNZ/supplejack_worker as ‘worker’.

Step Two - Setting up the Manager:

The Manager and Worker need a key to communicate with each other, the instructions below will tell you how to create them.

cd into your manager directory and run bundle install. This will install all of the dependencies of the application.

Once this has finished, copy the config/application.yml.example to config/application.yml. This file contains the environment variables that your manager needs to run. Looking at this file will show you the ports that the other applications in the app need to run on, and it is where you add the keys.

Now, boot up the rails console rails c and run User.create(email: 'your@email.com', name: 'Your Name', password: 'yourpassword', password_confirmation: 'yourpassword', role: 'admin'). These details are also what you will use to login to the Manager web interface.

Make sure you make yourself an admin otherwise you will not be able to create parser scripts.

This will create your user that you will use to login to the manager.

Now you will need to generate the worker key.

Run bundle exec rake secret

Write this key down, it is the WORKER_KEY

Step Three - Setting up the Worker

As with the manager, cd into your worker directory and run bundle install.

Once this has finished, copy the config/application.yml.example to config/application.yml.

For the Manager:

Add your WORKER_KEY to the manager/config/application.yml where it states WORKER_KEY. You will need to add this for each environment listed in this file.

For the Worker:

Add your WORKER_KEY to the worker/config/application.yml in the specified place. You will need to do this for the test, development, and staging environments.

You can now boot up the Manager and the Worker

** Make sure Mongo (port 27017) and Redis server (port 6379) are running **

Inside of the Manager directory, run bundle exec rails s -p 3001.

Inside of the Worker directory, run bundle exec rails s -p 3002. You will also need to run sidekiq. Sidekiq is responsible for processing the workers jobs. Inside of the worker directory, run bundle exec sidekiq.

The worker, manager and api ports are configured in the config/application.yml file of each project.

To determine that everything is working correctly, visit the following:

The Manager: http://localhost:3001 You will be able to login here with the credentials you previously generated for the manager.

The Workers Sidekiq: http://localhost:3002/sidekiq Here you will be able to see a visual representation of the queues that are being processed.

Step Five - Setting up the API

The Supplejack API project on github is mountable rails engine. This means it is intended on being run inside of a host Rails application. Our first step will be creating the host application.

First you will need to install Rails 5.

gem install rails --version=5.1.4.

Because we are using an older version of rails, the most simple way to get started is like this:

create a new folder called api, inside of the folder create a Gemfile.
In the Gemfile, add the following:
source 'https://rubygems.org'
gem 'rails', '5.1.4'
run bundle install.

Once it has finished, run bundle exec rails new . --force --skip-bundle.

Now, open up the app in your text editor and add gem 'supplejack_api', git: 'https://github.com/DigitalNZ/supplejack_api.git' to the Gemfile.

Once you have done this, run bundle install. If there are conflicts, run bundle update and they should resolve themselves.

Next, we can run the supplejack installer with bundle exec rails g supplejack_api:install. Press Y for any questions that it asks you. This will generate the extra files that your app needs to be a working Supplejack API.

You also need to remove this line:

protect_from_forgery with: :exception

from your app/controllers/application_controller.rb

Lastly, we need to configure the harvester key that allows the Worker and the Manager to communicate with the API. To do this, open up the rails console, with rails console, and create a user.

SupplejackApi::User.create(email: 'yourharvest@email.com', password: 'password', password_confirmation: 'password', role: 'harvester')

Note, make sure that you have the harvester role. Otherwise you will get an authentication failed error when you attempt to create content partners and run harvests.

Once the user has been made, copy it’s authentication_token and put it in the config/application.yml of each the Worker and the Manger where it says HARVESTER_API_KEY. Again, you will need to do this for each environment.

We also need the WORKER_HOST and the WORKER_KEY environment variables in the API. So add these environment variables and the correct values to them in config/application.yml.

You are now ready to run the stack!

Step Six - Running the Stack

Boot the API:

Cd into your api directory and boot it up on port 3000

bundle exec rails s -p 3000

Boot Solr

Cd into your API directory and run bundle exec rails generate sunspot_rails:install. You will be asked if you wish to overwrite sunpot.yml? Answer Yes.

Now run:

bundle exec rake sunspot:solr:run

Boot the Manager

cd into your manager directory and run:

bundle exec rails s -p 3001

Note if you are still running the manager from previous steps, you will need to restart it for the config changes to take effect. Kill the Rails app with ctrl-c.

Boot the Worker

Cd into your worker directory and run:

bundle exec rails s -p 3002

Note if you are still running the worker from previous steps, you will need to restart it for the config changes to take effect. Kill the Rails app with ctrl-c.

Boot Sidekiq

Cd into your worker directory: bundle exec sidekiq

You are now ready to set up your parser! Here we will make an example parser but once you have got it working.

The documentation for the parser DSL can be found here:

http://digitalnz.github.io/supplejack/

Step Six

Log into your manager and click ‘New Data Source’, give it a name and a contributor.

Once it completes, hover over ‘Contributors and Scripts’ in the top navigation, click ‘Parser Scripts’, and then click ‘New Parser Script’.

Name your parser, Otago Hocken, select the contributor and data source that you created before, and then choose OAI format.

The click ‘Create Parser Script’.

You will now have a screen which will allow you to write your parser script for the harvester, this parser is written in Ruby and has it’s own DSL, which is documented at the link that I provided earlier.

Make your parser look like this:

class OtagoHocken < SupplejackCommon::Oai::Base
  base_url "http://otago.ourheritage.ac.nz/oai-pmh-repository/request"
  validates :usage, inclusion: {in: ["Share", "Modify", "Use commercially", "All rights reserved", "Unknown"]}

  validates :landing_url, format: {with: /\Ahttps?:/}
  validates :thumbnail_url, format: {with: /\Ahttps?:/}
  validates :large_thumbnail_url, format: {with: /\Ahttps?:/}
  validates :landing_url, size: { is: 1 }
  validates :internal_identifier, size: { is: 1 }

  namespaces dc: 'http://purl.org/dc/elements/1.1/',
  			 oai_dc: 'http://www.openarchives.org/OAI/2.0/oai_dc/',
  					xsi: 'http://www.w3.org/2001/XMLSchema-instance',
  			dcterms: 'http://purl.org/dc/terms/',
  						o: 'http://www.openarchives.org/OAI/2.0/'

  attributes :content_partner, :display_content_partner,     default: "University of Otago"
  attributes :display_collection, :primary_collection,       default: "Otago University Research Heritage"
  attribute  :collection,															       default: ["Otago University Research Heritage"]
  attributes :copyright, :usage,                             default: "All rights reserved"
  attributes :dc_rights, :rights_url,                        default: "http://digital.otago.ac.nz/terms.php"
  #attribute :dc_type, default: "Watercolors"

  attribute :title,         xpath: "//dc:title"
  attribute :description,   xpath: "//dc:description"
  attribute :date,          xpath: "//dc:date",        date: true
  attribute :display_date,  xpath: "//dc:date"
  attribute :contributor,   xpath: "//dc:contributor"
  attribute :publisher,     xpath: "//dc:publisher"
  attribute :subject,	      xpath: "//dc:subject"
  attribute :source,	      xpath: "//dc:source"
  attribute :creator,	      xpath: "//dc:creator"
  attribute :dc_type,	      xpath: "//dc:type"
  attribute :format,	      xpath: "//dc:format"

  attribute :category do
    category = "Images"
    category = "Videos" if get(:dc_type).find_with(/^Video$/).present?
    category
  end

  attributes :landing_url do
    fetch("//dc:identifier").find_with(/^http:\/\/otago.ourheritage.ac.nz\/items\/show/)
  end

  attribute :internal_identifier do
    get(:landing_url).downcase
  end

  attributes :large_thumbnail_url do
    fetch("//dc:identifier").find_with(/\.jpg$/).mapping(/original/ => 'fullsize').first
  end

  attributes :thumbnail_url do
    get(:large_thumbnail_url).mapping(/fullsize/ => 'square_thumbnails')
  end

  attribute :dc_identifier do
    dcidentifier = get(:dc_identifier)
    dcidentifier += fetch("//header/identifier")
    dcidentifier += fetch("//metadata//identifier").find_without(/^http/)
    dcidentifier
  end

  reject_if do
    not get(:landing_url).find_with(/^http/i).present?
  end
end

You should now be able to preview your parser! Click the preview button and the various components of Supplejack will whurr into a working state.

If you get an error such as ‘Unable to connect to localhost:3001 over (1)’ you will need to change the config/application.yml of the MANAGER and the WORKER to have the API HOST, MANAGER HOST, and WORKER HOST, as http://127.0.0.1:port instead of http://localhost:port.

If you get an error such as Errno::ECONNREFUSED: Failed to open TCP connection to 127.0.0.1:3001 (Connection refused - connect(2) for "127.0.0.1" port 3001) when previewing / harvesting, add the -b 0.0.0.0 option to the rails s command when running the stack locally.

If you do not get any records coming through in your preview, look in the Sidekiq pane of your terminal. If there is an error about not being able to find a Record with a specific ID, you will have a mismatch within the Record and PreviewRecord IDs. The best thing to do here is to go into the API project, run the rails console and delete the SupplejackApi::Record and SupplejackApi::PreviewRecord models.

You can do this with SupplejackApi::Record.destroy_all and SupplejackApi::PreviewRecord.destroy_all

If your preview is successful, you will see a modal window appear a json object of the potential records. Something like this:

{
  "priority": 0,
  "match_concepts": null,
  "content_partner": [
    "University of Otago"
  ],
  "display_content_partner": [
    "University of Otago"
  ],
  "display_collection": [
    "Otago University Research Heritage"
  ],
  "primary_collection": [
    "Otago University Research Heritage"
  ],
  "collection": [
    "Otago University Research Heritage"
  ],
  "copyright": [
    "All rights reserved"
  ],
  "usage": [
    "All rights reserved"
  ],
  "dc_rights": [
    "http://digital.otago.ac.nz/terms.php"
  ],
  "rights_url": [
    "http://digital.otago.ac.nz/terms.php"
  ],
  "title": [
    "Gog and Magog, Stewart Island. 1880."
  ],
  "description": [
    "Lower left (l.l.) in ink: W. Deverell Jany 1880; margin below image in pencil: Gog &amp; Magog Stewart Island."
  ],
  "date": [

  ],
  "display_date": [

  ],
  "contributor": [

  ],
  "publisher": [

  ],
  "subject": [
    "Islands",
    "Landscape"
  ],
  "source": [
    "Found uncatalogued in Hocken 1947. Dr T.M. Hocken’s Collection"
  ],
  "creator": [
    "unknown"
  ],
  "dc_type": [
    "Image",
    "Still Image",
    "Watercolors",
    "Art"
  ],
  "format": [

  ],
  "category": [
    "Images"
  ],
  "landing_url": [
    "http://otago.ourheritage.ac.nz/items/show/4480"
  ],
  "internal_identifier": [
    "http://otago.ourheritage.ac.nz/items/show/4480"
  ],
  "large_thumbnail_url": [
    "http://s3.amazonaws.com/ourheritagemedia%2Ffullsize%2Fa24297b36e4d17d4b8bd4da8f7d6bb6c.jpg"
  ],
  "thumbnail_url": [
    "http://s3.amazonaws.com/ourheritagemedia%2Fsquare_thumbnails%2Fa24297b36e4d17d4b8bd4da8f7d6bb6c.jpg"
  ],
  "dc_identifier": [
    "oai:otago.ourheritage.ac.nz:4480"
  ],
  "source_id": "otago-hocken",
  "data_type": "record"
}

If you have a successful preview, close the modal window, scroll to the bottom of the page, enter some text in the message field, and click Update Parser Script. Next, in the righthand columns, click on the name of your parser script under History, then click the blue button Tag as Staging.

You can now run a staging harvest, note that ‘Staging’ is a term bound to the Manager and does not relate to the Rails environment.

Click ‘Run Harvest’, and then click ‘Staging Harvest’, enter 1000, and then click start. I do not know how large this example harvest is.

Step Seven - Indexing your Records

Once the harvest has finished you are now ready to index your records into Solr. On our DNZ servers we have a background job that handles this, but since we are running locally we can just do it in the Rails console of the API.

Cd into your api repository, enter the rails console and run the following.

Sunspot.session = Sunspot::Rails.build_session
Sunspot.index(SupplejackApi::Record.all)
Sunspot.commit

If you want to delete all of your records, you can run this. Do not run on production

Sunspot.session = Sunspot::Rails.build_session
SupplejackApi::Record.destroy_all
Sunspot.remove_all
Sunspot.commit
SupplejackApi::PreviewRecord.destroy_all

You can now see your records on the API! Go to http://localhost:3000/records.json?fields=verbose,title,thumbnail_url,large_thumbnail_url to see the serialized JSON response from Solr.

Note the fields that are returned by default are determined by the groups on your RecordSchema. You can alter the your RecordSchema to whichever fields you would like, check out the docs here.

To index your records in the background, run the rake task that is provided by the engine as a seperate process.

bundle exec rake index_processor:run[1000]

This will look in Mongo for batches of records of your provided size to pull out and index. It will also remove records that have been deleted. We like to run this as a seperate long running ruby process.

Step Eight - Using the supplejack client to pull data from your API

Create a new Rails application, this can be Rails 5 if you would like.

rails new my_app_name

Add the supplejack_client to your Gemfile.

gem ‘supplejack_client’, git: ‘https://github.com/DigitalNZ/supplejack_client.git’ Then run bundle install

The default testing framework that Rails ships with is MiniTest, all of our projects use RSpec to add it to the Gemfile and follow the configuration steps to get up and running.

Then you can run rails g supplejack:install. This will scaffold the basic things your app needs to communicate with your Supplejack API.

Now open up config/initializers/supplejack_client.rb and add your config.api_key and your config.api_url. The api key is generated the same was as the one you generated for the harvester. Refer back to Step 5 except this time you want to be either a developer or an admin user. SupplejackApi::User.create(email: 'youruser@email.com', password: 'password', password_confirmation: 'password', role: 'admin'). The harvester role is only for the worker and the manager to communicate with the API.

The config.api_url will need to be uncommented.

You also need to update the config.fields array to include the symbols :verbose, :title, :thumbnail_url, :large_thumbnail_url, :record_id.

This will ensure that we get some useful data through the API.

Now create a record model and include the Supplejack::Record class into it. Don’t use the rails g model record command to do this, as it will create a rails migration, which is unnecessary: this model won’t inherit from ApplicationRecord like most models do. Instead, Supplejack will govern the model behaviour and link it to our api.

Like this:

class Record
  include Supplejack::Record
end

Now you can search the API in the rails console with Supplejack::Search.new(params) or find a specific record like Record.find(record_id).

You can find a record_id by adding record_id to the list of fields in the api GET request you previously made in the browser.

Note: You may also want to update the REQUEST_LIMIT_MAILER setting in my_app_name/config/application.yml to an email address that you manage to receive admin/service alerts.)

Step Nine - Showing your data on your host app

Create a new records_controller in your host application. You can do this with rails g controller records. The generate command will create the controller, the view, and the related spec files.

Note: if you need to undo your rails g command, you can run the same command again but using rails d instead of g. Make your records controller look like this:

class RecordsController < ApplicationController
  def index
    @search = Supplejack::Search.new(params)
    @records = @search.results
  end

  def show
    @record = Record.find(params[:id])
  end
end

Update your config/routes.rb file to look like this:

resources :records

Now you can add two templates, one for the index page and one for the show page. Make your app/views/records/index.html.erb file look like this

<h1>Otago Hocken</h1>
<% @records.each do |record| %>
  <%= link_to(record.title, record_path(record.record_id)) %>
  <%= image_tag(record.thumbnail_url) %>
<% end %>

And your app/views/records/show.html.erb file look like this:

<h1><%= @record.title %></h1>
<%= image_tag(@record.large_thumbnail_url) %>
<%= link_to('Back to records', records_path) %>

Now boot up your application, bundle exec rails s -p 3005 and visit the /records page.

You will be able to see the harvested content! There will be no styling but feel free to template up the site and make it look how you want. You can search against the API by going to /records?text=your_text_string

The API also supports pagination, so you can pass &per_page=number and &page=number to do this. You can find out the total number of items on the @search object itself, it has a method ‘total’.

You now have every element of the Supplejack stack working together correctly, awesome job!

Installation Walk-through by Example