Parser DSL (Domain Specific Language)
Records source
base_url
The base_url method allows the operator to specify where to fetch the harvest resources from. It accepts a URL or a absolute path in disk. Additionally the operator can specify different urls/paths for each environment.
proxy
The proxy method allows the operator to specify a proxy that the base_url needs to be requested through.
Web resource
base_url "http://gdata.youtube.com/feeds/api/videos"
Path
base_url "file:///data/sites/harvester/resources/nz_museum.xml"
Different paths per environment
base_url staging: "file:///data/sites/harvester/staging/nz_museum.xml"
base_url production: "file:///data/sites/harvester/production/nz_museum.xml"
All the files in a directory
Dir.glob('/export/home/harvest/nlnzcat/updates/*').sort.each do |filename|
base_url "file://#{filename}"
end
Authentication
basic_auth
Allows to the operator to add HTTP Basic Authentication to all requests executed by the parser applied to it.
The first string is the username and the second is the password.
basic_auth "username", "password"
This will basically append “username:password” to the URL’s of the requests executed by this parser. So a request to: “http://gdata.youtube.com/feeds/api/videos” will be converted to “http://username:password@gdata.youtube.com/feeds/api/videos”
http_headers
Allows the operator to add http headers to the request, so that requests can be made to protected endpoints.
It can be used like
http_headers({'x-api-key': 'api-key', 'Authorization': 'Token token="token"'})
Pagination
paginate
Allows the operator to paginate through a API and specify the name of the parameters used by the particular API as well as the values for those parameters.
Numbered pagination
base_url "http://gdata.youtube.com/feeds/api/videos"
paginate page_parameter: "start-index", type: "item", per_page_parameter: "max-results", per_page: 50, page: 1, total_selector: "//openSearch:totalResults"
Tokenised pagination
Tokenised pagination is enabled for XML, OAI, and JSON parser scripts.
base_url "http://gdata.youtube.com/feeds/api/videos"
paginate type: 'token', next_page_token_location: "$.nextPageToken", per_page: 1, per_page_parameter: "maxResults", page_parameter: 'pageToken'
The above example will execute the following requests:
http://gdata.youtube.com/feeds/api/videos?start-index=1&max-results=50
http://gdata.youtube.com/feeds/api/videos?start-index=51&max-results=50
http://gdata.youtube.com/feeds/api/videos?start-index=101&max-results=50
etc...
The total_selector option is to extract the total amount of records so that the paginator knows when to stop. For tokenised pagination, this is an optional parameter. Without it, the harvest will stop when the next_page_token is not present.
The type option refers to the weather the pagination is implemented by specifiying the starting index like in the above case where you need to specify “item” or the case where you specify the actual page value, for this case use the value “page”. Use type: 'tokenised'
for tokenised pagination.
For apis that require an initial parameter for the first tokenised paginated request, but not for successive requests, you can use the initial_param: ‘cursor=*’ method.
Scroll Harvest
You can harvest from an Elastic Search scroll endpoint by providing paginte type: "scroll"
in your parser. The worker is expecting all requests to be a GET request and it knows when to stop when there are no more results coming from the API. The results are expected to be in the body of the document in the JSON body[‘hits’][‘hits’] keys.
To generate the next URL the worker will convert the base URL (EG https://content-partner.com/search/collection/_search?scroll=10m&q=334) into the format (https://content-partner.com/search/_search/scroll/<-scroll_id->?scroll=10m&q=334) and it will pull the scroll-id from the body[‘_scroll_id’] key in the response. Here is an example:
paginate type: "scroll", duration_parameter: 'scroll', duration_value: '1m'
If your content partner is doing something different, you can override the way the the worker will generate the next scroll url and how it determines what the next URL is by providing the respective blocks.
Here is an example:
paginate type: "scroll",
next_scroll_url_block: proc { |url, klass| url.match('(?<base_url>.+\/collection)')[:base_url] + klass._document.headers[:location] },
scroll_more_results_block: proc { |klass| klass._document.code == 303 }
The next scroll url block tells the worker how to construct the base url and where the next scroll token comes from. The scroll_more_results_block tells the worker when to keep going and fetch more results.
Reject records
reject_if
The operator can use the reject_if directive to reject records which match the criteria specified.
reject_if do
get(:title).find_with(/Weekly Review/).present?
end
Delete records
delete_if
The operator can use the delete_if directive to mark records as deleted in the api which match the criteria specified.
delete_if do
get(:title).find_with(/Weekly Review/).present?`
end
When the block of code returns a true value the record will be marked as deleted in the API. In the particular case above all records with a title that matches “Weekly Review” will be deleted.
Throttling
throttle
To throttle the requests to a specific host add a throttle directive to the parser configuration.
class TestParser < HarvesterCore::Xml::Base
throttle :host => "gdata.youtube.com", :delay => 10
end
The effect of the throttle directive above will be that requests to the gdata.youtube.com host will be at least 10 seconds apart from each other.
Fractions of a second are also allowed for finer grained control.
Request timeout
request_timeout
To change the length of time that the harvester will wait before timing out a request you can do the following:
class TestParser < HarvesterCore::Xml::Base
request_timeout 60000
end
Note: The time is in milliseconds, so the above example will set a request timeout of 60 seconds
Harvesting non-primary sources
priority
By defining a priority, the harvest operator indicates that this parser will not overwrite the primary source, but will create an additional source on the record with a matching internal identifier. The priority is a positive or negative integer, a value of 0 will disable this feature (overwrite the primary source).
priority 5
Unintuitively, sources with more negative priorities are used first, and more positive priorities are used last. This could also be expressed as higher priorities are lower priority.
Concepts
Matching
To specify the type of concept matching required, set the @match_concepts@ directive as below.
match_concepts :create_or_update
It can be set to:
:create
which doesn’t perform any matching, always creating new concepts:match
which will only match (and update the sameAs field on the matching concept with data from this harvest):create_or_update
which will match if it exists, otherwise create a new concept.
XML Specific
Record Selector
When the source of the records is in one big XML file it is necessary to split the document into XML fragments that represent each record.
class TestParser < HarvesterCore::Xml::Base
record_selector "//item"
end
The “//item” string represents the xpath that will split the XML file.
Sitemap Entry Selector
In cases where the base_url points to a sitemap, the operator can specify the xpath that matches each URL in the sitemap.
class TestParser < HarvesterCore::Xml::Base
sitemap_entry_selector "//loc"
end
The most common use case for this is sitemaps and the typical node where sitemaps store the URL is a node.
Record Format
When harvesting from a sitemap the most common case is that the resources specified in the sitemap are HTML web pages, but there are cases when they are XML.
For these cases it is necessary to specify what is the format of the web resources.
class NzOnScreen < HarvesterCore::Xml::Base
sitemap_entry_selector "//loc"
record_format :xml`
end
Multiple Records from multiple Sitemap entries
Sometimes a sitemap_entry_selector will specify a location that specifies multiple records, in this case you will need to use all 3 of the above attributes in conjunction with each other, for example:
class NzOnScreen < HarvesterCore::Xml::Base
sitemap_entry_selector "//loc"
record_selector "//feed//entry"
record_format :xml`
end
In this instance the sitemap entries are nodes. At each location specified by a sitemap entry there are multiple record entries which can be selected by specifying the record_selector. In this instance the expected format of the records entries is XML.
Namespaces
See XML Namespaces – link to namepaces page soon
The harvest operator can define namespaces at the class level. This allows the operator to then specify which namespace it wants to use for specific xpath queries.
class NzOnScreen < HarvesterCore::Xml::Base
namespaces dc: "http://purl.org/dc/elements/1.1/"
attribute :category, xpath: "//dc:identifier"
end
Node
Sometimes the data will contain groups of tags which relate to each other. For example a number of authors, each with a first name and last name. In this case the harvest operator can select the group node and then access the children in a block.
attribute :contributor do
contributor = get(:contributor)
node("//person").each do |node|
contributor += compose(node.xpath("first-name").text, " ", node.xpath("last-name").text)
end
contributor
end
If you are wanting to use predefined namespaces inside of a node block you will need to add self._namespaces to all your xpath method calls. This is because within the node block the node is just a nokogiri element.
node("./metadata/m:record/m:datafield[@tag='260']").each do |node|
text << node.xpath("./m:subfield", self._namespaces).map(&:text).join("; ")
end
OAI Specific
(this line seems wrong - the default OAI implementation should not specify a “set” at all)
By default, the OAI implementation uses &metadataPrefix=oai_dc with no set value, but these can be configured on a per-parser basis, as below:
MetadataPrefix
base_url "http://emu.tepapa.govt.nz/oai/oai.php"
metadata_prefix 'oai_dc_mm'
Set
base_url "http://emu.tepapa.govt.nz/oai/oai.php"
set 'CEISMIC'
JSON Specific
Record Selector
When there are many records in one JSON file, it is necessary to select the array containing records.
class TestParser < HarvesterCore::Json::Base
record_selector "$.items"
end
This is a JsonPath expression that selects the array. If the root of the file is an array, you would use “$”
Paths
Attributes are selected within each record using JsonPath
attribute :title, path: "$.title"
attribute :author, path: "$.author.name"
See http://goessner.net/articles/JsonPath/ for more details on JSON path.
Note: JSON key names containing special characters like : and similar should be surrounded in ‘single quotes’, even if they are the old thing in the path. For example:
attribute :title, path: "'dc:title'"
# or
attribute :title, path: "$.'dc:title'"
Preprocess source data before running the parser script
This optional block allows manipulation of the response data from your harvest source, before it is handed on to the rest of the parser as per normal. It could be used for any type of pre-processing data clean up requirements but was initially designed to rationalise verbose feeds that mentioned items multiple times, keeping only the latest mention to be harvested.
JSON example
pre_process_block do |rest_client_response|
# Convert RestClient::Response to Hash
hash = JSON.parse(rest_client_response.body)
# Sort and uniq the data will result in only the latest of each item
hash = hash.sort do |item_a, item_b|
# 'updated_at' specifies the date to sort on
Date.parse(item_b['updated_at']) <=> Date.parse(item_a['updated_at'])
end
.uniq { |item| item['audio_id'] } # 'audio_id' specifies the unique item ID to rationalise with
# Convert back to JSON
json = hash.to_json
# Return a new RestClient::Response with the new mutated JSON
RestClient::Response.create(json, rest_client_response.net_http_res, rest_client_response.request)
end
XML example
pre_process_block do |rest_client_response|
# Convert RestClient::Response to Nokogiri Document
doc = Nokogiri::XML(rest_client_response.body) { |config| config.options = Nokogiri::XML::ParseOptions::NOBLANKS }
# Select node that contains all items
items_node = doc.at_xpath('//dnz-export')
# Sorting by the "date" field
sorted = items_node.children.sort_by do |item|
item.children.find { |child| child.name == 'date' }.text
end.reverse!
# uniq will keep only the latest mention of each item based on the unique ID of that item (specified in "key")
uniq = sorted.uniq do |item|
item.children.find { |child| child.name == 'key' } .text
end
# Replace all children with new values
items_node.children.remove
uniq.each{ |n| items_node << n }
# Return a new rest response
RestClient::Response.create(doc.to_xml, rest_client_response.net_http_res, rest_client_response.request)
end