An Architecture Blueprint for a Central Logging System

1. An Architecture Blueprint for a Central Logging System

1.1. Introduction

Logging is mostly treated as a local affair: Of an application or solution, a team, or even a single developer, maybe a group thereof. But the increasing complexity of software systems increases the effort to draw the right conclusions from a large heap of heterogeneous logging records.

The trend from large monolithic application building blocks to smaller — but therefore intensely interconnected — software components is present. Therefore s IT Solutions AT (IT subsidiary of Erste Bank and Sparkassen in Austria, www.s-itsolutions.at/) started a project to care for a central logging and journalling data lake.
The architectural pattern here is a description along the line of this "Central Logging & Journalling" solution/application of sIT. It is targeted not at smaller systems but tries to deal with enterprises and larger IT landscapes.

So, if you feel your path to wisdom by reading logs looks like this, you’re a happy developer already.

But if it looks a bit more like this, further reading could maybe help you.

1.2. Goals

Logging is not an end in itself — it enables many use-cases that can be grouped into four partitions:

[A]. Support: Find out about what the system did in its runtime, in order to detect the source of problems or give information to other stakeholders. Mostly use-cases here investigate in exceptional program behaviour.

[B]. Compliance: The number of regulatory use-cases increases; the run-time behaviour and intermediate data of software must be documented, often for many years.

[C]. Monitoring & Alerting: When a stream of logging data exists, it is self-evident to use this stream also to find out about the system state as well as problems and report those via multiple channels.

[D]. Analytics & Intelligence: Sophisticated tools allow data mining, BI etc. to find out ways to improve the business, may it be by exploring customer behaviour, may it by predicting operations problems, or may it be something we don’t even dream of yet.

Table 1. Use-case groups
Support [A]	Compliance [B]	Monitoring & Alerting [C]	Analytics & Intelligence [D]
* Customer Care * Issue research → Access Security → Searchable, neartime	* Regulatory queries → Long term → Safe data store → Ideally certified → Unfrequent queries	* Stream analysis * Alerting endpoints → Needs rules → High performing	* Statistics * Big Data Analysis * Machine Learning * Predictive Analysis → Highly specialized toolset

1.3. Architecture

A possible architecture could make use of the following building blocks:

Figure 1. Logical architecture of CLJ

The functional as well as the non-functional, or quality, requirements of the use-case groups aforementioned are very different from each other. Therefore it makes sense to use different software products to fulfill those requirements.

1.3.1. Messaging Brick

This building block cares for a reliable (real 24/7) component where the applications can load their logging records up to. It is high-performant, lean, stable and therefore capable to also swallow extreme high-load peaks.

This building block also cares for the use-case group [C]. The type of product is queue-like, our implementation uses Apache Kafka (kafka.apache.org/). Another example would be to use Amazon’s Firehose/Kinesis in an AWS-based environment.

1.3.2. Online Research Store

This record store is responsible for the structured and fast search for log records and to find out about connections between those.

Our implementation uses ElasticSearch (www.elastic.co/de/) together with a self-written ReST service and an Angular-base front-end

1.3.3. Compliance Store

Selected log records (defined by the solution) are persisted in this record store. It is very reliable, needs a back-up, is fast with writing and stores the record before any tampering with it. On the other hand it does not need a super-sophisticated query facility.

In our implementation we decided for Apache Cassandra (cassandra.apache.org/). Other possibilities would be to store those selected records in flat files and archive them, or use a RDBMS.

1.3.4. The Client Side

The applications for logging can send their log records either directly into the messaging brick or by harvesting them from the filesystem or another data store. Both methods have pro’s and con’s.

Table 2. Harvesting methods.
Direct transfer	Logfile harvesting
+ Fastest + Possibly eliminates one component (file system) - Technically a tight coupling	+ Non-intrusive to existing applications - Needs another process (ressources + monitoring)

Direct transfer

Possibilities for applications to integrate are:

Own client libs of messaging brick
APIs for creating messages that fit to the data model of CLJ
Appenders for existing logging frameworks (e.g. log4j2 in Java, or log.net for C#)

Generally it is a good idea to offer integration libraries that care for the situations where the messaging brick suffers from a failure. In those cases the using application should not be brought down by logging, to mitigate the tight coupling issue.

Logfile harvesting

There are a lot of tools for that use-case, ranging from light-weight native apps that are integrated into the operating system up to full-scale ETL tools (en.wikipedia.org/wiki/Extract,_transform,_load).

A few examples:

rsyslog (www.rsyslog.com/)
fluentd (www.fluentd.org/)
Apache flume (flume.apache.org/)
logstash (www.elastic.co/products/logstash, the 'L' from the famous ELK stack)
ElasticSearch' beat-family as feeder for the heavy-weight logstash (www.elastic.co/products/beats)

In certain architectures, some of these products could server as the messaging brick itself.

2. Central Logging Datamodel

2.1. Partitioning of the log record space

Each record has its own id value, making it unique in all of the data stores.

For managing the stores (especially the online research store for [A]), though, it is necessary to organize the records in a number of dimensions. This separation then supports in the determination of

Access rights/permissions
Retention times
Backup strategy

With that, the integrated applications can gain a lot of control and flexibility for their data.

The dimensions that are suggested is a combination, fitting to the actual need, of those fields:

tenant (in case of a real multi-tenant system with separated accounts)
environment (if environments are not separated pyhsically or logically on the server side)
solution, this determines the organizational owner of the log records within a tenant
recordType, to distinguish between different needs of building blocks and types of logging and journal data.

2.2. Fields

This list of fields is a gross list of common values a log system could care for. Different applications in different contexts might use one or another subset of this enumeration, hardly setting them all. But, and that is the main reason for this list, values of similar semantics in a log record store should be named equally, to make traversing logs of different applications easier.
Mandatory fields are printed bold.

Table 3. NDM fields
Type	Field Name	Short Description	Long Description
String	id	Technical id for the log record	This can be set by the client (if trusted to care for uniqueness) or be omitted and set by the client. The server allow the id to be reused (=update) for semantics like records of timespans (like, e.g., sessions). Proposed algorithm is UUID.
Header Fields, meta data of each record
String	recordType	Type of the record.	This is an unbounded enumeration, it’s free to select from the solution; anyway, it is recommended to use a known value (see subpage) to make recognition of the semantics of the record easier. Record types can be shared between solutions, e.g. session, activity and techInfo are record types that are used by applications. The record type is used for partitioning the CLJ data stores for the permission system, as well as a key in defining retention periods and archiving strategy.
String	recordSubType	Additional field to identify the event	Can be used as a type of the source log record. For example, if the recordType is 'serverLog', the recordSubType could be "tomcat" oder "weblogic".
String	tenant	Institute number	If needed, for organizations serving multiple iurisdictional tenant, this is the tenant code.
String	environment	Environment identifier	If needed, when the development, test, staging, production etc. environments are not separted by dedicated data store instances but merged in one, this identifier determines from which environment a log record originates.
DateTime-WithFractionSeconds	recordTimestamp	When the log record has been created	If the client does not provide this, or the given value cannot be parsed on server side, the processing engine will create a timestamp as next best guess.
Long	sequence	Determines order	Often the record timestamp is not sufficient to discriminate and order a set of logrecords. E.g. ElasticSearch does not care for finer granularity than miliseconds. In this case the sequence field could store micros or nanos. Another possibility to use this field is that a client can care for a gapless sequence to be sure that no records are lost during transmission, procession and retrieval. A logging front-end can use this field as sole default order attribute or as secondary order attribute after recordTimestamp.
String	logLevel	Level of importance, as provided by many low-level logging systems	This field is optional, it is also not normalized, meaning that whatever the client solution provides here will be taken as-is. A lot of logging libraries have their own mind on this topic.
User Info, information about the person or technical systems connected to the log record
String	user	Unique user id in its userType domain	This identifies the user or system uniquely within the domain given in "userType". This value gets more gain in the context of current data protection laws.
String	userType	Domain this user account belongs to	Needed if different user domains should be distinguished — like internet users (customers) and intranet users (employees). Or when user domains ob subsidiaries are not clearly separated by the user ids.
Source Info, which component wrote this log record
String	solutionCode	Unique identifier of a solution	Identifies the Solution as unit in the IT landscape.
String	solutionFunctionCode	Id of functional building block	If needed, more fine-grained organizational partitioning.
String	sourceApplication	Building block	More technical/architectural paritioning key.
String	sourceHostname	System name of the server initiating the logging call	e.g. DNS of physical or virtual system
String	sourceIp	Client IP, originator of the log	The value might differ in regard of the nature of the originator (e.g. a browser-based application, or a batch)
String	userAgent	Software that initiated the call	This field is used when the software and its version of the user/client is relevant. e.g. In web front-ends this identifies the Browser that has been used. The writing solution can give any information if it thinks that information about its caller makes a difference.
String	agentVersion	TODO	deprecated, might be removed in the future.
String	serverInstanceName	Identifies the server instance	e.g. the docker pod
Initiating solution
String	clientId	Code from initiating system	Inititiating systems are mostly user front-ends or batch processes.
Harvesting Info, where was the log record first persistet, might be different from the source solution
String	sourceType	Syntax of the incoming data	Syntax of the incoming data (into the messaging brick). 'generic' means using this data model in JSON, this is the default value. If the syntax is not 'generic' the central logging service might be able to to a proper transformation.
String	loggingHostname	Server Host Name	like sourceHostname
String	loggingHostIp	Server IP address	The system that provided the logging information, e.g. Apache host for access logs, or any other harvisting service running logstash, fume, rsyslog or a similar tool.
String	logFile	file name and path from which the log record has been harvested, if applicable	If logrecords are not sent directly to the messaging building block, but harvested from a logfile (by Logstash or a similar software) here this filename and path of the appropriate format (Windows, Unix, Mainframe, …) can be sent if needed.
Context
String	parentId	Hierarchical predecessor of this log record.	Could be of a functional or sequential order Here a key of a hierarchical higher-level record can be set. So a tree-like structure of log records can be created.
String	contextId1	Mapping context id field 1	Example: The id of a user session.
String	contextId2	Mapping context id field 2	Example: The (use case) id of a user’s activity.
String	contextId3	Mapping context id field 3	Example: The id of a explicit technical log record.
String	contextId4	Mapping context id field 4
DateTime-WithFractionSeconds	startDate	Start date of the record	For journalling records that have a time span, this field of the event signals the begin timestamp.
DateTime-WithFractionSeconds	endDate	End date of the session	For journalling records that have a time span, this field of the event signals the end timestamp.
String	correlationId	Correlation ID for a synchronous or quasi-synchronous call	Unique Id that is created as early as possible (ideally by the initiator) and then guided through the whole call hierarchy to create traces of calls.
Unstructured and semistructured data
String	message	Log Message	All the information that is not part of other fields
String	additionalInfo	semi-structured data	Business or other data. Technically this is a text field. It is recommended, though, to use JSON syntax, because the front-end can interpret it and display a tree structure. Special Case of additionalInfo: External Links. This can be rendered in the UI as Link with following Syntax: additionalInfo.extlink.ref : The URI for the external Link; additionalInfo.extlink.name : The DisplayName for the Link.
Result section
String	resultCode	Code if the record represents a task of any kind	HTTP record code, Exception, Error
String	errorMessage	Error Message	Any standardized code or message the sending solutions wants to log.
Boolean	businessError	Business Error	Sometimes business errors are stored as normal messages. It is up to the application to decide which message is a business error or a message. This value should be true for business errors
Status	normalizedStatus	Status field red/yellow/green	This field is for the user, giving a hint about whether this log record represents ok status, a warning or an error. enum Status { red yellow green }
Technical information
String	thread	Name of the server thread
String	logger	Software origin	Name of the class and method(optional) which logs this message
Long	durationMs	Duration of a call in milliseconds
String	logProcessingError	StackTrace of the log processing error.	This is not provided by the client solution but used if anything goes wrong in CLJ log record processing.

3. About CLJ

CLJ is a proposal to harmonize logging in an environment where multiple software building blocks are working together in order to fulfill shared requirements.

CLJ is a design blue-print, a proposal how to align a share logging environment.

Which building blocks to position in order to have smooth operations
Which fields to care for, having a common naming convention
Think about the use-cases that support the organization
Grounded in a running system of a not-so-small bank subsidiary
Feedback and contribution highly appreciated.
Source: CLJ’s asciidoc sources are hosted at CLJ sources.
Twitter: @mcaviti

Authored by the CLJ team in www.s-itsolutions.at [s IT Solutions AT], lead by Klemens Dickbauer.