Skip to content

Improve overcrawl tracking for BioCASe datasets #24

@MattBlissett

Description

@MattBlissett

If you put this in an file count_request.xml:

<?xml version='1.0' encoding='UTF-8'?>
<request xmlns="http://www.biocase.org/schemas/protocol/1.3">
        <header><type>search</type></header>
        <search>
                <requestFormat>http://www.tdwg.org/schemas/abcd/2.06</requestFormat>
                <responseFormat start='0' limit='10'>http://www.tdwg.org/schemas/abcd/2.06</responseFormat>
                <filter>
                        <like path='/DataSets/DataSet/Metadata/Description/Representation/Title'>University of Vienna, Institute for Botany - Herbarium WU</like>
                </filter>
                <count>true</count>
        </search>
</request>

and then query the appropriate BioCASe endpoint:

curl --data-urlencode query@count_request.xml 'http://131.130.131.9/biocase/pywrapper.cgi?dsa=gbif_wu'

The response gives us a record count:

  <biocase:content recordCount="0" recordDropped="0" recordStart="0" totalSearchHits="0">
    <biocase:count>107838</biocase:count>
  </biocase:content>

It's way off from what we have in GBIF, so perhaps this dataset is very overingested: https://management-tools.gbif.org/crawl-history?uuid=0afba960-be3b-4202-a7de-736ae05aec9e

{
  "datasetKey" : "0afba960-be3b-4202-a7de-736ae05aec9e",
  "lastCrawlId" : 34,
  "lastCrawlCount" : 80655,
  "recordCount" : 155141,
  "lastCrawlFragmentEmittedCount" : 104005,
  "finishReason" : "NORMAL",
  "processStateOccurrence" : "RUNNING",
  "crawlInfo" : [ {
    "crawlId" : 3,
    "count" : 1046
  }, {
    "crawlId" : 4,
    "count" : 4
  }, {
    "crawlId" : 5,
    "count" : 2
  }, {
    "crawlId" : 8,
    "count" : 1
  }, {
    "crawlId" : 9,
    "count" : 2
  }, {
    "crawlId" : 10,
    "count" : 2
  }, {
    "crawlId" : 11,
    "count" : 29
  }, {
    "crawlId" : 12,
    "count" : 29381
  }, {
    "crawlId" : 13,
    "count" : 10192
  }, {
    "crawlId" : 14,
    "count" : 33069
  }, {
    "crawlId" : 15,
    "count" : 1
  }, {
    "crawlId" : 16,
    "count" : 58
  }, {
    "crawlId" : 17,
    "count" : 2
  }, {
    "crawlId" : 18,
    "count" : 2
  }, {
    "crawlId" : 20,
    "count" : 268
  }, {
    "crawlId" : 21,
    "count" : 1
  }, {
    "crawlId" : 22,
    "count" : 111
  }, {
    "crawlId" : 23,
    "count" : 27
  }, {
    "crawlId" : 24,
    "count" : 134
  }, {
    "crawlId" : 25,
    "count" : 123
  }, {
    "crawlId" : 26,
    "count" : 3
  }, {
    "crawlId" : 30,
    "count" : 17
  }, {
    "crawlId" : 31,
    "count" : 1
  }, {
    "crawlId" : 33,
    "count" : 10
  }, {
    "crawlId" : 34,
    "count" : 80655
  } ],
  "percentagePreviousCrawls" : 48.011808612810285
}

A different dataset always returns less than the search.

This could belong as a core part of the crawler, we have a mostly-unused "declaredCount" field in the registry with a related purpose, but I think it hasn't been implemented as it's very unreliable. Maybe it would be better as a tool on the management console.

@gbif/content can think about it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions