Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
d32ec9f
feat: decoupled prometheus exporter's calculation and output
SkyeYoung Jun 26, 2025
35a38ad
fix: require process
SkyeYoung Jun 27, 2025
55619cf
fix
SkyeYoung Jun 27, 2025
467363a
fix: part of exportor logic
SkyeYoung Jun 27, 2025
e34a924
rewrite: use FFI to get nginx_status
SkyeYoung Jun 27, 2025
d91509b
chore
SkyeYoung Jun 27, 2025
d2b3bd4
Merge remote-tracking branch 'upstream/master' into young/perf/promet…
SkyeYoung Jun 27, 2025
b22e74f
chore: simplify logic
SkyeYoung Jun 27, 2025
27eec34
chore: rm useless logic
SkyeYoung Jun 27, 2025
a42006f
fix: lint
SkyeYoung Jun 27, 2025
740c39b
fix: lint
SkyeYoung Jun 27, 2025
e4800f5
fix: try init cached metrics to pass tests
SkyeYoung Jul 7, 2025
97e00e7
try fix logic
SkyeYoung Jul 7, 2025
171406d
fix: init only one time
SkyeYoung Jul 7, 2025
57959e6
fix: lint
SkyeYoung Jul 8, 2025
a22b9a7
chore: try fix tests
SkyeYoung Jul 8, 2025
27ac5c9
test: fix prometheus related cases
SkyeYoung Jul 9, 2025
1fb6217
Merge remote-tracking branch 'upstream/master' into young/perf/promet…
SkyeYoung Jul 9, 2025
521f4c9
fix: logic
SkyeYoung Jul 9, 2025
226a14c
Revert "fix: logic"
SkyeYoung Jul 9, 2025
f35b711
fix: init etcd logic
SkyeYoung Jul 17, 2025
df84c76
Merge remote-tracking branch 'upstream/master' into young/perf/promet…
SkyeYoung Jul 17, 2025
b998b64
fix: logic
SkyeYoung Jul 17, 2025
0d61202
fix
SkyeYoung Jul 17, 2025
3601800
fix: test case
SkyeYoung Jul 18, 2025
9523fed
fix: test cases
SkyeYoung Jul 18, 2025
6ae0448
test: improve stability
SkyeYoung Jul 18, 2025
005cc83
chore(ngx_tpl): rm prom privileged_agent conf
SkyeYoung Jul 18, 2025
7a06f5c
test: use preprocessor to add refresh_interval config
SkyeYoung Jul 18, 2025
a7dcd0e
fix
SkyeYoung Jul 18, 2025
6b8d544
fix
SkyeYoung Jul 18, 2025
a68f444
test: add preprocessor, rm useless case
SkyeYoung Jul 18, 2025
e89de27
feat: use a separate shared dict cache
SkyeYoung Jul 21, 2025
5167098
chore: add comment
SkyeYoung Jul 21, 2025
3364351
feat: add fallback err msg when data not exsits
SkyeYoung Jul 21, 2025
77e4132
fix: prometheus init logic
SkyeYoung Jul 21, 2025
a25069e
fix: workaround
SkyeYoung Jul 21, 2025
bb845a0
fix
SkyeYoung Jul 21, 2025
6fb9bf9
fix: rm sleep
SkyeYoung Jul 22, 2025
5395418
fix: tests
SkyeYoung Jul 22, 2025
fe01d03
fix: tests
SkyeYoung Jul 22, 2025
287a176
fix: tests
SkyeYoung Jul 22, 2025
db209c3
fix: tests
SkyeYoung Jul 22, 2025
2023022
fix: stream plugin prometheus
SkyeYoung Jul 22, 2025
508655a
fix: workaround
SkyeYoung Jul 22, 2025
9bf5b5e
fix: xrpc test
SkyeYoung Jul 22, 2025
85eafa5
fix: tests
SkyeYoung Jul 22, 2025
72c6f22
fix: lint
SkyeYoung Jul 22, 2025
9d5e45c
fix: add usefull comments and logs
SkyeYoung Jul 22, 2025
61f4812
chore: add comments
SkyeYoung Jul 22, 2025
cf66013
fix: standardized error return
SkyeYoung Jul 22, 2025
6de5129
fix: log forciable
SkyeYoung Jul 23, 2025
0b6f1a9
fix: return and log shared.DICT.get err
SkyeYoung Jul 23, 2025
432af12
fix: add more explicit error messages when ngx.shared.DICT.set fails
SkyeYoung Jul 23, 2025
4d77fc2
fix: use ngx.timer.every to reduce potential problems
SkyeYoung Jul 23, 2025
cd96ecd
fix: add cache exptime
SkyeYoung Jul 23, 2025
f28295d
chore: rm blank line
SkyeYoung Jul 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion apisix/cli/config.lua
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ local _M = {
meta = {
lua_shared_dict = {
["prometheus-metrics"] = "15m",
["prometheus-cache"] = "10m",
["standalone-config"] = "10m",
["status-report"] = "1m",
}
Expand Down Expand Up @@ -323,7 +324,8 @@ local _M = {
export_addr = {
ip = "127.0.0.1",
port = 9091
}
},
refresh_interval = 15
},
["server-info"] = {
report_ttl = 60
Expand Down
21 changes: 9 additions & 12 deletions apisix/cli/ngx_tpl.lua
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,9 @@ lua {
{% if enabled_stream_plugins["prometheus"] then %}
lua_shared_dict prometheus-metrics {* meta.lua_shared_dict["prometheus-metrics"] *};
{% end %}
{% if enabled_plugins["prometheus"] or enabled_stream_plugins["prometheus"] then %}
lua_shared_dict prometheus-cache {* meta.lua_shared_dict["prometheus-cache"] *};
{% end %}
{% if standalone_with_admin_api then %}
lua_shared_dict standalone-config {* meta.lua_shared_dict["standalone-config"] *};
{% end %}
Expand Down Expand Up @@ -96,22 +99,20 @@ http {
}

init_worker_by_lua_block {
require("apisix.plugins.prometheus.exporter").http_init(true)
local prometheus = require("apisix.plugins.prometheus.exporter")
prometheus.http_init(true)
prometheus.init_exporter_timer()
}

server {
{% if use_apisix_base then %}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, this API can be run in a normal worker

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The export API will no longer be exposed to privileged processes, which provides isolation of HTTP traffic from root privileges for enhanced security.
Therefore this is no longer needed.

listen {* prometheus_server_addr *} enable_process=privileged_agent;
{% else %}
listen {* prometheus_server_addr *};
{% end %}
listen {* prometheus_server_addr *};

access_log off;

location / {
content_by_lua_block {
local prometheus = require("apisix.plugins.prometheus.exporter")
prometheus.export_metrics(true)
prometheus.export_metrics()
}
}

Expand Down Expand Up @@ -577,11 +578,7 @@ http {

{% if enabled_plugins["prometheus"] and prometheus_server_addr then %}
server {
{% if use_apisix_base then %}
listen {* prometheus_server_addr *} enable_process=privileged_agent;
{% else %}
listen {* prometheus_server_addr *};
{% end %}
listen {* prometheus_server_addr *};

access_log off;

Expand Down
2 changes: 1 addition & 1 deletion apisix/core/config_etcd.lua
Original file line number Diff line number Diff line change
Expand Up @@ -1001,7 +1001,7 @@ function _M.new(key, opts)
sync_times = 0,
running = true,
conf_version = 0,
values = nil,
values = {},
Copy link
Member Author

@SkyeYoung SkyeYoung Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this modification, the following problems will be encountered:

Failed test 't/core/config_etcd.t TEST 12: test route with special character "-" - pattern "[error]" should not match any line in error.log but matches line "2025/07/09 14:24:01 [error] 1169428#1169428: *26 [lua] exporter.lua:534: Failed to collect metrics: 

/home/xxx/apisix/apisix/consumer.lua:77: attempt to get length of local 'data_list' (a nil value), context: ngx.timer" (req 0)

In practice, even in init.lua, placing the plugin.init_prometheus() after all .init_worker, the relevant values may still be nil as shown in the error above. Therefore, it is necessary to handle the values before initialization.


After the discussion with yuansheng and bzp a few days ago, there are two ways to handle it, one is the current one, and the other is as follows:

-- add between https://github.com/SkyeYoung/apisix/blob/7a06f5c2f7b74cf621a0c2ff6878e8aa0e4e99a7/apisix/core/config_etcd.lua#L1030-L1031
else
  load_full_data(obj, { nodes: {}}, nil)
  self.need_reload = true
-- ^ This is why not use this method, 
-- there is a hidden variable in `load_full_data` that needs to be reset

Without resetting this variable, the etcd-sync.t#TEST 5 will be failed. (The reason is that the subsequent execution of load_full_data is missing)

Thanks for the help from @bzp2010 , I was lost in the code.

need_reload = true,
watching_stream = nil,
routes_hash = nil,
Expand Down
4 changes: 4 additions & 0 deletions apisix/init.lua
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,10 @@ function _M.http_init_worker()
if local_conf.apisix and local_conf.apisix.enable_server_tokens == false then
ver_header = "APISIX"
end

-- To ensure that all workers related to Prometheus metrics are initialized,
-- we need to put the initialization of the Prometheus plugin here.
plugin.init_prometheus()
end


Expand Down
30 changes: 11 additions & 19 deletions apisix/plugin.lua
Original file line number Diff line number Diff line change
Expand Up @@ -341,8 +341,6 @@ function _M.load(config)
return local_plugins
end

local exporter = require("apisix.plugins.prometheus.exporter")

if ngx.config.subsystem == "http" then
if not http_plugin_names then
core.log.error("failed to read plugin list from local file")
Expand All @@ -356,15 +354,6 @@ function _M.load(config)
if not ok then
core.log.error("failed to load plugins: ", err)
end

local enabled = core.table.array_find(http_plugin_names, "prometheus") ~= nil
local active = exporter.get_prometheus() ~= nil
if not enabled then
exporter.destroy()
end
if enabled and not active then
exporter.http_init()
end
Comment on lines -360 to -367
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some description under this comment explaining why we removed it and moved to init and destroy hooks.

Copy link
Member Author

@SkyeYoung SkyeYoung Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code skipped plugin.init() and old_plugin.destroy() used in https://github.com/apache/apisix/blob/6fb9bf94281525c1fca397f681b4890b69440369/apisix/plugin.lua, and implemented the overload of the prometheus plugin for some reason that I have not yet understood (perhaps because prometheus.lua originally did not contain two functions init and destroy).


The initial reason was that even after separating the init_prometheus part and placing it at the end of init_worker, directly calling exporter_timer() would still cause an error. After debugging, another initialization logic was found here. This is obviously redundant.

Currently, we provide init and destroy functions in prometheus.lua, allowing the initialization and reloading of the prometheus plugin to be handled within the plugin's own files, reducing coupling.

This also allows the prometheus plugin to revert to the mechanism provided by plugin.lua, reducing special cases, lowering the cost of understanding, and making the code easier to maintain.

end
end

Expand Down Expand Up @@ -808,18 +797,21 @@ do
end


function _M.init_worker()
function _M.init_prometheus()
local _, http_plugin_names, stream_plugin_names = get_plugin_names()
local enabled_in_http = core.table.array_find(http_plugin_names, "prometheus")
local enabled_in_stream = core.table.array_find(stream_plugin_names, "prometheus")

-- some plugins need to be initialized in init* phases
if is_http and core.table.array_find(http_plugin_names, "prometheus") then
local prometheus_enabled_in_stream =
core.table.array_find(stream_plugin_names, "prometheus")
require("apisix.plugins.prometheus.exporter").http_init(prometheus_enabled_in_stream)
elseif not is_http and core.table.array_find(stream_plugin_names, "prometheus") then
require("apisix.plugins.prometheus.exporter").stream_init()
-- For stream-only mode, there are separate calls in ngx_tpl.lua.
-- And for other modes, whether in stream or http plugins,
-- the prometheus exporter needs to be initialized.
if is_http and (enabled_in_http or enabled_in_stream) then
Copy link
Contributor

@bzp2010 bzp2010 Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE

We will always only handle metrics generation in the http subsystem.

  1. This will ensure that there is no duplication of execution on http and stream to waste compute resources.
  2. This simplifies the design.
  3. Whether or not the user has http enabled (i.e., whether or not it is in stream only mode), an http block for the Prometheus export API and its server block (:9091) will always be present, otherwise Prometheus would be pointless. This means that we can always have an http subsystem context for periodic generation of timers and metrics anyway, even if we are currently in stream only mode.

Copy link
Contributor

@bzp2010 bzp2010 Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some comments to the code to document the design intent. @SkyeYoung

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to add a link to this PR comment here.

https://github.com/apache/apisix/pull/12383/files#r2221993953

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bzp2010 I think this part of the code can be found by modifying the history, just like the old code.

require("apisix.plugins.prometheus.exporter").init_exporter_timer()
end
end


function _M.init_worker()
-- someone's plugin needs to be initialized after prometheus
-- see https://github.com/apache/apisix/issues/3286
_M.load()
Expand Down
9 changes: 8 additions & 1 deletion apisix/plugins/prometheus.lua
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@
local core = require("apisix.core")
local exporter = require("apisix.plugins.prometheus.exporter")


local plugin_name = "prometheus"
local schema = {
type = "object",
Expand All @@ -35,6 +34,7 @@ local _M = {
priority = 500,
name = plugin_name,
log = exporter.http_log,
destroy = exporter.destroy,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE

This will always destroy the plugin (the prometheus instance in it) when reloading it using the Admin API and loading it again based on the latest configuration.
If a reload is performed after a plugin is removed from the profile list, this plugin will not be restored again. Until the next reload.

Technically, exporter.destroy just backs up that instance of the prometheus module and copies it to another variable.
This will cause the export API to stop working, at which point it will always return a {}, which is consistent with the current behavior.
Under the hood, the timer will also stop working, no longer generating metrics based on interval timing, and the metrics computation overhead introduced by APISIX is completely eliminated.
When the next plugin reload occurs, if prometheus is re-enabled, the timer will resume running.

Regarding the background timer introduced by the prometheus third-party library, unfortunately, it never stops running.
It is registered with ngx.timer.every to perform the task of synchronizing the shdict at regular intervals, and this overhead cannot be paused or resumed by external intervention unless we fork and modify the library itself.

So this destruction does not mean that the prometheus instance is actually destroyed, the synchronization timer is stopped, and the shdict is cleared. none of this happens.

schema = schema,
run_policy = "prefer_route",
}
Expand All @@ -55,4 +55,11 @@ function _M.api()
end


function _M.init()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE

We turned to using the built-in hooks of the plugin system, namely init to initialize the prometheus instance and prometheus metrics registration.

Note, however, that data padding only occurs the first time the plugin is started (it is usually when the worker is started, i.e. the init_prometheus call in init.lua http_init_worker) and every timer.
This initialization just registers the metrics, but doesn't really populate the data.

local local_conf = core.config.local_conf()
local enabled_in_stream = core.table.array_find(local_conf.stream_plugins, "prometheus")
exporter.http_init(enabled_in_stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE

The prometheus plugin, loaded by the http subsystem, will register http metrics there, and will decide whether to register stream metrics (xrpc) depending on whether the stream subsystem has been started.
This is mainly for metrics generation needs in privileged processes, stream data is not really reported at any phase in the http subsystem.

end


return _M
Loading
Loading