diff --git a/README.rst b/README.rst index 8d1adc3966b48654f7addada3bd2a723cb84fda4..42ca1a1baebe100b645556d8ad4e8e89135d8f13 100644 --- a/README.rst +++ b/README.rst @@ -16,7 +16,7 @@ Deadbeat - README With Deadbeat you can create well behaving *nix application/daemon/cron script which ingests, processes, transforms and outputs realtime data. It allows you to create the geartrain of cogs, -where each cog does one specific action. The cigs range from fetching information, parsing, +where each cog does one specific action. The cogs range from fetching information, parsing, modification, typecasting, transformations, to enrichment, filtering, anonymisation, deduplication, aggregation, formatting, marshalling and output. diff --git a/manual.rst b/manual.rst index dbc561a55e5ac4e85fa81657645fff1a2b0b6100..41e3f1b9560da4524707e6d01853da64b7ce85ba 100644 --- a/manual.rst +++ b/manual.rst @@ -7,7 +7,7 @@ Use of this source is governed by an ISC license, see LICENSE file. -Welcome to Deadbeat documentation! +Deadbeat documentation ================================================================================ .. note:: @@ -29,8 +29,170 @@ Welcome to Deadbeat documentation! README doc/_doclib/modules +*Deadbeat* is a framework for creating event driven data transformation tools. + +With Deadbeat you can create well behaving \*nix application/daemon/cron script which ingests, +processes, transforms and outputs realtime data. It allows you to create the geartrain of cogs, +where each cog does one specific action. The cigs range from fetching information, parsing, +modification, typecasting, transformations, to enrichment, filtering, anonymisation, +deduplication, aggregation, formatting, marshalling and output. + +Deadbeat is event driven, it is able to watch and act upon common sources of events: timer, +I/O poll, inotify and unix signals. Usually Deadbeat just slacks around doing completely +nothing until some data arives (be it primary data feed or answers to network queries). + +The code supports both Python 2.7+ and 3.0+. + +The library is part of SABU_ project, loosely related to Warden_, and often used to process +IDEA_ events. + +.. _SABU: https://sabu.cesnet.cz +.. _Warden: https://warden.cesnet.cz +.. _IDEA: https://idea.cesnet.cz + +Example +-------------------------------------------------------------------------------- + +Short example to show basic concepts. Resulting script watches log file, expecting lines with IP +addresses, translates them into hostnames and outputs result into another text file. Tailing the +file is event driven (we spin the CPU only when new line arrives), DNS resolution also (other cogs +keep reading/processing the data while waiting for DNS replies). + +:: + + # Basic imports + import sys + from deadbeat import movement, log, conf, dns, text, fs + from deadbeat.movement import itemsetter + from operator import itemgetter + + # Define configuration, which will work both in JSON config file and as command line switches + conf_def = conf.cfg_root(( + conf.cfg_item( + "input_files", conf.cast_path_list, "Input log files", default="test-input.txt"), + conf.cfg_item( + "output_file", str, "Output file", default="test-output.txt"), + conf.cfg_section("log", log.log_base_config, "Logging"), + conf.cfg_section("config", conf.cfg_base_config, "Configuration") + )) + + def main(): + + # Read and process config files and command line + cfg = conf.gather(typedef=conf_def, name="IP_Resolve", description="Proof-of-concept bulk IP resolver") + + # Set up logging + log.configure(**cfg.log) + + # Instantiate Train object, the holder for all the working cogs + train = movement.Train() + + # Cog, which watches (tails) input files and gets new lines + file_watcher = fs.FileWatcherSupply(train.esc, filenames=cfg.input_files, tail=True) + + # CSV parser cog, which for the sake of the example expects only one column + csv_parse = text.CSVParse(fieldsetters=(itemsetter("ip"),)) + + # Cog, which resolves IP addresses to hostnames + resolve = dns.IPtoPTR(train.esc) + + # CSV writer cog, which creates line with two columns + serialise = text.CSVMarshall(fieldgetters=(itemgetter("ip"), itemgetter("hostnames"))) + + # Text file writer cog + output = fs.LineFileDrain(train=train, path=cfg.output_file, timestamp=False, flush=False) + + # Now put the geartrain together into the Train object + train.update(movement.train_line(file_watcher, csv_parse, resolve, serialise, output)) + + # And run it + train() + + + if __name__ == '__main__': + sys.exit(main()) + +Concepts +-------------------------------------------------------------------------------- + +Events +~~~~~~ + +Various system actions can trigger events including signals, inotify, events related to file descriptors. +Also, events can be scheduled or triggered by cogs themselves. + +Cogs +~~~~ + +Cogs are simple (ok, sometimes not so simple) objects, which act as callables, and once instantiated, wait +to be run with some unspecified data as an argument, while returning this data somehow altered. This alteration +can take the form of changing/adding/removing contents, splitting/merging the data, or generating or deleting +it altogether. + +Cogs can also withdraw the data, returning them into the pipeline in some way later. + +Cogs can (by means of Train and Escapement) plan themselves to be run later, or connect their methods with +various events. + +Special types of cogs are *supplies*, which generate new data, usually from sources outside of the application, +and *drains*, which output the data outwards from the scope of the application. But they are special only +conceptually, from the view of the Train and application any cog can generate new data or swallow it. + +Escapement +~~~~~~~~~~ + +This object (singleton within each application) takes care of various types of events in the system. It provides +API for cogs to register system and time based events. + +Train +~~~~~ + +Another singleton, which contains all the cogs and maps oriented acyclic graph of their data flow. Also +takes care of broadcast events delivery and correct (block free) shutdown sequence of cogs. + +Getters/setters +~~~~~~~~~~~~~~~ + +Cogs can operate upon various types of data. Cog initial configuration usually contains one or more getters +or setters. + +*Getter* is meant to be a small piece of code (usually lambda function or py:func:`operator.itemgetter`), whose +function is to return piece of the data for the cog to operate on (for example a specific key from dictionary). + +*Setter* is complementary small piece of code, whose function is to set/incorporate piece of data, generated or computed +by cog, into main data (for example setting specific key of a dictionary). To support both muttable and immutable +objects, setter must return the main data object as return value – so new immutable object can be created and replace +old one. + +Special is ID getter and setter. If provided, ID setter is used by supply cogs to assign new ID to arriving data, +while ID getter can be used for logging, and for referencing data for example in broadcasts. + +Data +~~~~ + +Anything, which flows within the geartrain/pipeline from one cog to the other. + +Application +~~~~~~~~~~~ + +Deadbeat is meant for creating self-supporting applications, which sit somewhere, continuously watching some sources +of thata, ingesting, processing and transforming the data, acting upon them and forwarding them on. + +Application is mainly a processing pipeline of the *cogs*, managed by *Train* (which contains *Escapement*). + +Supporting submodules are also configuration, logging and daemonization one. + +Config submodule takes application wide hierarchical configuration definition, and then it uses it for reading +and merging set of user provided configuration files together with command line arguments. Various cogs can also +be accompanied with piece of configuration definition insert, which can be used to simplify and reuse global +definition. + +Log submodule provides simple logging initialization, configurable consistently with other parts of Deadbeat. + +Daemon submodule provides well behaving \*nix daemonization, together with pidfile management. + Indices and tables -================================================================================ +-------------------------------------------------------------------------------- * :ref:`genindex` * :ref:`modindex`