Skip to content
Snippets Groups Projects
Commit 3c8c7385 authored by Pavel Kácha's avatar Pavel Kácha
Browse files

More documentation

parent 0d9b683e
No related branches found
No related tags found
No related merge requests found
...@@ -16,7 +16,7 @@ Deadbeat - README ...@@ -16,7 +16,7 @@ Deadbeat - README
With Deadbeat you can create well behaving *nix application/daemon/cron script which ingests, With Deadbeat you can create well behaving *nix application/daemon/cron script which ingests,
processes, transforms and outputs realtime data. It allows you to create the geartrain of cogs, processes, transforms and outputs realtime data. It allows you to create the geartrain of cogs,
where each cog does one specific action. The cigs range from fetching information, parsing, where each cog does one specific action. The cogs range from fetching information, parsing,
modification, typecasting, transformations, to enrichment, filtering, anonymisation, modification, typecasting, transformations, to enrichment, filtering, anonymisation,
deduplication, aggregation, formatting, marshalling and output. deduplication, aggregation, formatting, marshalling and output.
......
...@@ -7,7 +7,7 @@ ...@@ -7,7 +7,7 @@
Use of this source is governed by an ISC license, see LICENSE file. Use of this source is governed by an ISC license, see LICENSE file.
Welcome to Deadbeat documentation! Deadbeat documentation
================================================================================ ================================================================================
.. note:: .. note::
...@@ -29,8 +29,170 @@ Welcome to Deadbeat documentation! ...@@ -29,8 +29,170 @@ Welcome to Deadbeat documentation!
README README
doc/_doclib/modules doc/_doclib/modules
*Deadbeat* is a framework for creating event driven data transformation tools.
With Deadbeat you can create well behaving \*nix application/daemon/cron script which ingests,
processes, transforms and outputs realtime data. It allows you to create the geartrain of cogs,
where each cog does one specific action. The cigs range from fetching information, parsing,
modification, typecasting, transformations, to enrichment, filtering, anonymisation,
deduplication, aggregation, formatting, marshalling and output.
Deadbeat is event driven, it is able to watch and act upon common sources of events: timer,
I/O poll, inotify and unix signals. Usually Deadbeat just slacks around doing completely
nothing until some data arives (be it primary data feed or answers to network queries).
The code supports both Python 2.7+ and 3.0+.
The library is part of SABU_ project, loosely related to Warden_, and often used to process
IDEA_ events.
.. _SABU: https://sabu.cesnet.cz
.. _Warden: https://warden.cesnet.cz
.. _IDEA: https://idea.cesnet.cz
Example
--------------------------------------------------------------------------------
Short example to show basic concepts. Resulting script watches log file, expecting lines with IP
addresses, translates them into hostnames and outputs result into another text file. Tailing the
file is event driven (we spin the CPU only when new line arrives), DNS resolution also (other cogs
keep reading/processing the data while waiting for DNS replies).
::
# Basic imports
import sys
from deadbeat import movement, log, conf, dns, text, fs
from deadbeat.movement import itemsetter
from operator import itemgetter
# Define configuration, which will work both in JSON config file and as command line switches
conf_def = conf.cfg_root((
conf.cfg_item(
"input_files", conf.cast_path_list, "Input log files", default="test-input.txt"),
conf.cfg_item(
"output_file", str, "Output file", default="test-output.txt"),
conf.cfg_section("log", log.log_base_config, "Logging"),
conf.cfg_section("config", conf.cfg_base_config, "Configuration")
))
def main():
# Read and process config files and command line
cfg = conf.gather(typedef=conf_def, name="IP_Resolve", description="Proof-of-concept bulk IP resolver")
# Set up logging
log.configure(**cfg.log)
# Instantiate Train object, the holder for all the working cogs
train = movement.Train()
# Cog, which watches (tails) input files and gets new lines
file_watcher = fs.FileWatcherSupply(train.esc, filenames=cfg.input_files, tail=True)
# CSV parser cog, which for the sake of the example expects only one column
csv_parse = text.CSVParse(fieldsetters=(itemsetter("ip"),))
# Cog, which resolves IP addresses to hostnames
resolve = dns.IPtoPTR(train.esc)
# CSV writer cog, which creates line with two columns
serialise = text.CSVMarshall(fieldgetters=(itemgetter("ip"), itemgetter("hostnames")))
# Text file writer cog
output = fs.LineFileDrain(train=train, path=cfg.output_file, timestamp=False, flush=False)
# Now put the geartrain together into the Train object
train.update(movement.train_line(file_watcher, csv_parse, resolve, serialise, output))
# And run it
train()
if __name__ == '__main__':
sys.exit(main())
Concepts
--------------------------------------------------------------------------------
Events
~~~~~~
Various system actions can trigger events including signals, inotify, events related to file descriptors.
Also, events can be scheduled or triggered by cogs themselves.
Cogs
~~~~
Cogs are simple (ok, sometimes not so simple) objects, which act as callables, and once instantiated, wait
to be run with some unspecified data as an argument, while returning this data somehow altered. This alteration
can take the form of changing/adding/removing contents, splitting/merging the data, or generating or deleting
it altogether.
Cogs can also withdraw the data, returning them into the pipeline in some way later.
Cogs can (by means of Train and Escapement) plan themselves to be run later, or connect their methods with
various events.
Special types of cogs are *supplies*, which generate new data, usually from sources outside of the application,
and *drains*, which output the data outwards from the scope of the application. But they are special only
conceptually, from the view of the Train and application any cog can generate new data or swallow it.
Escapement
~~~~~~~~~~
This object (singleton within each application) takes care of various types of events in the system. It provides
API for cogs to register system and time based events.
Train
~~~~~
Another singleton, which contains all the cogs and maps oriented acyclic graph of their data flow. Also
takes care of broadcast events delivery and correct (block free) shutdown sequence of cogs.
Getters/setters
~~~~~~~~~~~~~~~
Cogs can operate upon various types of data. Cog initial configuration usually contains one or more getters
or setters.
*Getter* is meant to be a small piece of code (usually lambda function or py:func:`operator.itemgetter`), whose
function is to return piece of the data for the cog to operate on (for example a specific key from dictionary).
*Setter* is complementary small piece of code, whose function is to set/incorporate piece of data, generated or computed
by cog, into main data (for example setting specific key of a dictionary). To support both muttable and immutable
objects, setter must return the main data object as return value – so new immutable object can be created and replace
old one.
Special is ID getter and setter. If provided, ID setter is used by supply cogs to assign new ID to arriving data,
while ID getter can be used for logging, and for referencing data for example in broadcasts.
Data
~~~~
Anything, which flows within the geartrain/pipeline from one cog to the other.
Application
~~~~~~~~~~~
Deadbeat is meant for creating self-supporting applications, which sit somewhere, continuously watching some sources
of thata, ingesting, processing and transforming the data, acting upon them and forwarding them on.
Application is mainly a processing pipeline of the *cogs*, managed by *Train* (which contains *Escapement*).
Supporting submodules are also configuration, logging and daemonization one.
Config submodule takes application wide hierarchical configuration definition, and then it uses it for reading
and merging set of user provided configuration files together with command line arguments. Various cogs can also
be accompanied with piece of configuration definition insert, which can be used to simplify and reuse global
definition.
Log submodule provides simple logging initialization, configurable consistently with other parts of Deadbeat.
Daemon submodule provides well behaving \*nix daemonization, together with pidfile management.
Indices and tables Indices and tables
================================================================================ --------------------------------------------------------------------------------
* :ref:`genindex` * :ref:`genindex`
* :ref:`modindex` * :ref:`modindex`
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment