-
Notifications
You must be signed in to change notification settings - Fork 54
Performance improvements #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The `FORMATTER_RECORD_FIELD_SKIP_LIST` is used heavily to remove fields from the resulting document in the fashion `x in FORMATTER_RECORD_FIELD_SKIP_LIST`. This operation is `O(n)` for lists, but `O(1)` to `O(n)` for sets.
The `_get_record_fields` method extracts and formats the items of the `LogRecord` object into the message dict. After collecting the static extra fields and adding the per-record extra fields, `_remove_excluded_fields` is called to remove any keys we don't want. `_remove_excluded_fields` is called for the two `dict`s `message` and `extra_fields` and removes items if the key is present in `FORMATTER_RECORD_FIELD_SKIP_LIST`. We can improve this by not adding those fields in the first place since we're iterating over `record.__dict__` anyways.
@@ -32,11 +32,11 @@ class Constants: | |||
# Usually this list does not need to be modified. Add/Remove elements to | |||
# exclude/include them in the Logstash event, for the full list see: | |||
# http://docs.python.org/library/logging.html#logrecord-attributes | |||
FORMATTER_RECORD_FIELD_SKIP_LIST = [ | |||
FORMATTER_RECORD_FIELD_SKIP_LIST = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I wasn't aware there is such a difference on lookup performance between lists and sets.
It would be even better if we keep the constant a list here and convert it to a set later in the Formatter class, probably best in its __init__()
. Then people can still append/remove items as they wish. Even if named constants, they are meant to be customised if necessary.
A set is also mutable but with a different API and so users had to adjust their code.
I think converting the list to a set later when creating a Formatter instance, should be ok as this happens only once (or only a few times if the formatter is configured multiple times or on forking or so but still not as often as log records are to be formatted).
@@ -133,7 +133,9 @@ def _format_timestamp(self, time_): | |||
|
|||
# ---------------------------------------------------------------------- | |||
def _get_record_fields(self, record): | |||
return {k: self._value_repr(v) for k, v in record.__dict__.items()} | |||
return {k: self._value_repr(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Since we removed now all unwanted fields from the record, we could skip trying to remove them again in _remove_excluded_fields
by only passing extra_fields
to the method.
Or maybe we incoporate the field removing code into _get_extra_fields
as it remains the only place where the filtering is necessary yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I thought of that. However, subclasses may depend on _remove_excluded_fields
to be called.
Another optimization would be to fold the _get_record_fields
, _get_extra_fields
and _remove_excluded_fields
into one method: It iterates over record.__dict__
and builds a dict
that can be message.update
d in the format
method. This would eliminate some iterations, checks and eventually del
calls as well.
However, this would be a larger break in API and behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about API breakage.
Maybe it's worth for these changes to bump to 3.0 and announce the API changes, so people will notice.
In my hunt for a bottleneck in one of our applications, I've found another few hotspots. Not that it was slow before, but when you push a few thousand messages per minute, some things just pop up.
This PR contains two small but measurable improvements. My approach was a simple
timeit
script, run standalone and with a profiler:With the current 2.3.0, this script completed (on my machine) in around 2.6 seconds.
The first change was to change
FORMATTER_RECORD_FIELD_SKIP_LIST
to a set: this list is used in the fashion ofx in y
which is way faster for sets than list. This brought the execution time down to 2.2 seconds.The other change was to exclude fields from
FORMATTER_RECORD_FIELD_SKIP_LIST
in_get_record_fields
: The formatter pulled all fields from themessage
and theextra
into the dict and removed the fields afterwards in_remove_excluded_fields
. Since we're iterating over__dict__.items()
in_get_record_fields
anyway, we can skip the fields right there and save us thedel
later on.With both changes, I'm down to 1.7 seconds.
I'm working on another improvement involving the datetime handling, but that will come in another PR as it involves a little bit more code.