Skip to content

When monitoring /var/log/journal/remote events are missed when two remote hosts send messages at the same time (updates) #134

@louwrentius

Description

@louwrentius

Context

  • 1 x log server (VM on Apple M2)
  • 2 x log client (VM on Apple M2)
  • Using systemd-journal-remote on server (http)
  • Using systemd-journal-upload on clients (http)

Testing Method
We first cleanup the server with:

rm /var/log/journal/remote/*

On each client we run (side by side / simultaneously:

for x in {1..10000}; do echo "EVENT $x" | logger ; done

So we expect 20,000 records on the server, which we validate:

root@log-server:/home/ansible# journalctl -D /var/log/journal/remote/ --no-pager | grep -v Boot | wc -l
20000

I've tested this many times to confirm that the systemd-journal-remote/upload mechanism is OK.

The failing code

import select
from systemd import journal

 j = journal.Reader(path="/var/log/journal/remote")
 j.seek_tail()
 j.get_previous() # prevent wrap around

 p = select.poll()
 p.register(j, j.get_events())

 print("Starting....")
 event_counter = 0
 while p.poll():
      for x in j:
          event_counter += 1
      if j.process() != journal.APPEND:
          print("...WAITING....")
      else:
          print(f"Event counter: {event_counter}")

First of all, I'm not sure if this code is right. If I've made a glaring mistake, I apologise in advance for this report.

That said, I get correct results if I run my '10K log generator command' on one client at a time.

Event counter: 9915
Event counter: 9933
Event counter: 9945
Event counter: 9950
Event counter: 9987
Event counter: 10000

If I run the command at the same time on each client at once, I would expect 20K records, but it gets nowhere near that:

Event counter: 10228
Event counter: 10228
Event counter: 10228
Event counter: 10228
Event counter: 10228
Event counter: 10228
Event counter: 10228

In this example I'm missing almost half of all expected events. Yet I can confirm on the log server with the 'journalctl' command that there are actually 20K log records stored on the server.

Even if I test with just 1K events per client and run that simultaneously it's also clearly visible:

Event counter: 1096
Event counter: 1141
Event counter: 1215
Event counter: 1430

Happy to run any additional tests if that helps.
If I made a huge mistake, please let me know.

Update 1
I've been testing this issue on virtual machines on my mac.
I've also a few physical x86 micro pc's and the issue is the same.
Rate-limiting the logging with sleep .01 doesn't impact the result.

Question, is python-systemd built to handle remote logging from multiple clients through polling?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions