Improve overcrawl tracking for BioCASe datasets

If you put this in an file `count_request.xml`:

```xml
<?xml version='1.0' encoding='UTF-8'?>
<request xmlns="http://www.biocase.org/schemas/protocol/1.3">
        <header><type>search</type></header>
        <search>
                <requestFormat>http://www.tdwg.org/schemas/abcd/2.06</requestFormat>
                <responseFormat start='0' limit='10'>http://www.tdwg.org/schemas/abcd/2.06</responseFormat>
                <filter>
                        <like path='/DataSets/DataSet/Metadata/Description/Representation/Title'>University of Vienna, Institute for Botany - Herbarium WU</like>
                </filter>
                <count>true</count>
        </search>
</request>
```

and then query the appropriate BioCASe endpoint:

```
curl --data-urlencode query@count_request.xml 'http://131.130.131.9/biocase/pywrapper.cgi?dsa=gbif_wu'
```

The response gives us a record count:

```xml
  <biocase:content recordCount="0" recordDropped="0" recordStart="0" totalSearchHits="0">
    <biocase:count>107838</biocase:count>
  </biocase:content>
```

It's way off from what we have in GBIF, so perhaps this dataset is very overingested: https://management-tools.gbif.org/crawl-history?uuid=0afba960-be3b-4202-a7de-736ae05aec9e

```
{
  "datasetKey" : "0afba960-be3b-4202-a7de-736ae05aec9e",
  "lastCrawlId" : 34,
  "lastCrawlCount" : 80655,
  "recordCount" : 155141,
  "lastCrawlFragmentEmittedCount" : 104005,
  "finishReason" : "NORMAL",
  "processStateOccurrence" : "RUNNING",
  "crawlInfo" : [ {
    "crawlId" : 3,
    "count" : 1046
  }, {
    "crawlId" : 4,
    "count" : 4
  }, {
    "crawlId" : 5,
    "count" : 2
  }, {
    "crawlId" : 8,
    "count" : 1
  }, {
    "crawlId" : 9,
    "count" : 2
  }, {
    "crawlId" : 10,
    "count" : 2
  }, {
    "crawlId" : 11,
    "count" : 29
  }, {
    "crawlId" : 12,
    "count" : 29381
  }, {
    "crawlId" : 13,
    "count" : 10192
  }, {
    "crawlId" : 14,
    "count" : 33069
  }, {
    "crawlId" : 15,
    "count" : 1
  }, {
    "crawlId" : 16,
    "count" : 58
  }, {
    "crawlId" : 17,
    "count" : 2
  }, {
    "crawlId" : 18,
    "count" : 2
  }, {
    "crawlId" : 20,
    "count" : 268
  }, {
    "crawlId" : 21,
    "count" : 1
  }, {
    "crawlId" : 22,
    "count" : 111
  }, {
    "crawlId" : 23,
    "count" : 27
  }, {
    "crawlId" : 24,
    "count" : 134
  }, {
    "crawlId" : 25,
    "count" : 123
  }, {
    "crawlId" : 26,
    "count" : 3
  }, {
    "crawlId" : 30,
    "count" : 17
  }, {
    "crawlId" : 31,
    "count" : 1
  }, {
    "crawlId" : 33,
    "count" : 10
  }, {
    "crawlId" : 34,
    "count" : 80655
  } ],
  "percentagePreviousCrawls" : 48.011808612810285
}
```

A [different dataset](https://management-tools.gbif.org/crawl-history?uuid=9666593a-f762-11e1-a439-00145eb45e9a) always returns less than the search. 

This could belong as a core part of the crawler, we have a mostly-unused "declaredCount" field in the registry with a related purpose, but I think it hasn't been implemented as it's very unreliable. Maybe it would be better as a tool on the management console.

@gbif/content can think about it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve overcrawl tracking for BioCASe datasets #24

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve overcrawl tracking for BioCASe datasets #24

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions