Skip to content
Snippets Groups Projects
  • Yorhel's avatar
    d523a77f
    Improve exclude pattern matching performance (and behavior, a bit) · d523a77f
    Yorhel authored
    Behavioral changes:
    - A single wildcard ('*') does not cross directory boundary anymore.
      Previously 'a*b' would also match 'a/b', but no other tool that I am
      aware of matches paths that way. This change breaks compatibility with
      old exclude patterns but improves consistency with other tools.
    - Patterns with a trailing '/' now prevent recursing into the directory.
      Previously any directory excluded with such a pattern would show up as
      a regular directory with all its contents excluded, but now the
      directory entry itself shows up as excluded.
    - If the path given to ncdu matches one of the exclude patterns, the old
      implementation would exclude every file/dir being read, this new
      implementation would instead ignore the rule. Not quite sure how to
      best handle this case, perhaps just exit with an error message?
    
    Performance wise, I haven't yet found a scenario where this
    implementation is slower than the old one and it's *significantly*
    faster in some cases - in particular when using a large amount of
    patterns, especially with literal paths and file names.
    
    That's not to say this implementation is anywhere near optimal:
    - A list of relevant patterns is constructed for each directory being
      scanned. It may be possible to merge pattern lists that share
      the same prefix, which could both reduce memory use and the number of
      patterns that need to be matched upon entering a directory.
    - A hash table with dynamic arrays as values is just garbage from a
      memory allocation point of view.
    - This still uses libc fnmatch(), but there's an opportunity to
      precompile patterns for faster matching.
    d523a77f
    History
    Improve exclude pattern matching performance (and behavior, a bit)
    Yorhel authored
    Behavioral changes:
    - A single wildcard ('*') does not cross directory boundary anymore.
      Previously 'a*b' would also match 'a/b', but no other tool that I am
      aware of matches paths that way. This change breaks compatibility with
      old exclude patterns but improves consistency with other tools.
    - Patterns with a trailing '/' now prevent recursing into the directory.
      Previously any directory excluded with such a pattern would show up as
      a regular directory with all its contents excluded, but now the
      directory entry itself shows up as excluded.
    - If the path given to ncdu matches one of the exclude patterns, the old
      implementation would exclude every file/dir being read, this new
      implementation would instead ignore the rule. Not quite sure how to
      best handle this case, perhaps just exit with an error message?
    
    Performance wise, I haven't yet found a scenario where this
    implementation is slower than the old one and it's *significantly*
    faster in some cases - in particular when using a large amount of
    patterns, especially with literal paths and file names.
    
    That's not to say this implementation is anywhere near optimal:
    - A list of relevant patterns is constructed for each directory being
      scanned. It may be possible to merge pattern lists that share
      the same prefix, which could both reduce memory use and the number of
      patterns that need to be matched upon entering a directory.
    - A hash table with dynamic arrays as values is just garbage from a
      memory allocation point of view.
    - This still uses libc fnmatch(), but there's an opportunity to
      precompile patterns for faster matching.
scan.zig 39.96 KiB