-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
If you put this in an file count_request.xml:
<?xml version='1.0' encoding='UTF-8'?>
<request xmlns="http://www.biocase.org/schemas/protocol/1.3">
<header><type>search</type></header>
<search>
<requestFormat>http://www.tdwg.org/schemas/abcd/2.06</requestFormat>
<responseFormat start='0' limit='10'>http://www.tdwg.org/schemas/abcd/2.06</responseFormat>
<filter>
<like path='/DataSets/DataSet/Metadata/Description/Representation/Title'>University of Vienna, Institute for Botany - Herbarium WU</like>
</filter>
<count>true</count>
</search>
</request>and then query the appropriate BioCASe endpoint:
curl --data-urlencode query@count_request.xml 'http://131.130.131.9/biocase/pywrapper.cgi?dsa=gbif_wu'
The response gives us a record count:
<biocase:content recordCount="0" recordDropped="0" recordStart="0" totalSearchHits="0">
<biocase:count>107838</biocase:count>
</biocase:content>It's way off from what we have in GBIF, so perhaps this dataset is very overingested: https://management-tools.gbif.org/crawl-history?uuid=0afba960-be3b-4202-a7de-736ae05aec9e
{
"datasetKey" : "0afba960-be3b-4202-a7de-736ae05aec9e",
"lastCrawlId" : 34,
"lastCrawlCount" : 80655,
"recordCount" : 155141,
"lastCrawlFragmentEmittedCount" : 104005,
"finishReason" : "NORMAL",
"processStateOccurrence" : "RUNNING",
"crawlInfo" : [ {
"crawlId" : 3,
"count" : 1046
}, {
"crawlId" : 4,
"count" : 4
}, {
"crawlId" : 5,
"count" : 2
}, {
"crawlId" : 8,
"count" : 1
}, {
"crawlId" : 9,
"count" : 2
}, {
"crawlId" : 10,
"count" : 2
}, {
"crawlId" : 11,
"count" : 29
}, {
"crawlId" : 12,
"count" : 29381
}, {
"crawlId" : 13,
"count" : 10192
}, {
"crawlId" : 14,
"count" : 33069
}, {
"crawlId" : 15,
"count" : 1
}, {
"crawlId" : 16,
"count" : 58
}, {
"crawlId" : 17,
"count" : 2
}, {
"crawlId" : 18,
"count" : 2
}, {
"crawlId" : 20,
"count" : 268
}, {
"crawlId" : 21,
"count" : 1
}, {
"crawlId" : 22,
"count" : 111
}, {
"crawlId" : 23,
"count" : 27
}, {
"crawlId" : 24,
"count" : 134
}, {
"crawlId" : 25,
"count" : 123
}, {
"crawlId" : 26,
"count" : 3
}, {
"crawlId" : 30,
"count" : 17
}, {
"crawlId" : 31,
"count" : 1
}, {
"crawlId" : 33,
"count" : 10
}, {
"crawlId" : 34,
"count" : 80655
} ],
"percentagePreviousCrawls" : 48.011808612810285
}
A different dataset always returns less than the search.
This could belong as a core part of the crawler, we have a mostly-unused "declaredCount" field in the registry with a related purpose, but I think it hasn't been implemented as it's very unreliable. Maybe it would be better as a tool on the management console.
@gbif/content can think about it.
Metadata
Metadata
Assignees
Labels
No labels