Machine log collection, management, and analytics are essential for a company to thrive in the information age cloud business. I recently learned how a top-performing cloud company architects its log management infrastructure. The corporate log service underlying evolves continuously based on the need, and it generates critical business values. This write-up first describes the log system implementation on AWS. It then discusses the architecture decision to address resource limitations, control the operation budget, and reduce operation disruption.
Centralize log management is always a preferred solution for handling enterprise logging. Application, IT system, and cloud micro-service logs are collected and managed in one central location, for example, AWS EKS.
DevOps and automation teams play a central role in developing and maintaining the log management infrastructure. Log collections and analytics data preparation phases are all in place. See [1]. As nice as the system seen from high ground, there are engineering overheads to address the infrastructure’s computing, storage, and budget resource limitation.
The ingestion pipeline first filters the log data by extracting the useful log portion to reduce logging noises – what logs or log portions to retain or remove. The trimming directives usually come from the log data end-user, such as a data scientist or system analyst. They communicate with the DevOps/automation team to create customized log extraction filters for deployment. The goal is to control the logging volume size and content. The DevOps/Automation team builds the ad-hoc best-effort log filters, and the process is laborious and error-prone.
This log reduction filter control is in place for the log data infrastructure limitation. In this example, it uses Amazon elastic search service (EKS). Both the compute and storage resources for indexing needs to be checked for a healthy logging system. To maintain a performing stable operating state, the total number of ingesting logs is controlled at 30GB/day, see [8], and total backlogs are retained for two weeks. Data are backed to an economical Amazon S3 storage after that.
Ingested log data are critical for company operations. Different business functional units for the company deals with different log data and different usage. For example, the performance and capacity team extracts metrics from logs to model system usage trends and forecast future demand. The customer business unit would extract metrics to analyze customer insight and creates business values, for example, customer churn. The DevOps would maintain the system’s operating state over SLA (Service Level Agreement) requirement. The log data metrics extraction and the subsequent analytics are highly customized, flexible, and fluid. Each functional unit is specifically created to solve specific business problems. It is highly desirable to utilize AI/ML techniques and methodology, see [7]. For example, holistically, log ingestion data pipe can be enhanced with tag or label to facilitate later AI/ML analysis. All the mutable fields in the log are automatically extracted for analysis. This subject will be revisited in another blog post.
As mentioned earlier, the deployed Amazon ES operating resource needs to be monitored and controlled for a performing log system. Amazon ES infrastructure can scale-out, but the effort is not walk-in-the-park, and it often requires some degree of trial-and-error. The new ingesting log data capped at 30GB/day with a 2-week log data retention time for maintaining the operating budget and system performance. The system always holds about 500GB of new log records for processing. The overflown log is backup into Amazon S3 at $25/TB-month for an indefinite period. The AWS hosting of such a setup is around $60k/year, not including engineering and operator costs.
Here is a similar log infrastructure setup using Apica building blocks. The system now becomes simpler because addressing resource limitation is alleviated with the use of S3 storage. See figure below,
Apica log management infrastructure removes the engineering build-in infrastructure overhead. The new construct is efficient and straightforward. The table below lists the infrastructure resource engineering overheads, and the list tags are from an earlier drawing
Tag | Description | Overhead Action | Apica |
[4] | Need for reducing log ingestion to save log process resources. Consult and communicate with log end-user. | The process is error-prone due to log and apps changes and requirements. | Simple log data ingestion pipeline |
[5] | DevOps implement log filter to reduce ingest log count | Implementations and validate the stored log with end-user | No need to trim log data. Store un-redacted logs into S3 storage. |
[8] | Maintain constant overhead and scale-out ELK if needed | Elastic search service does not scale seamlessly. | Apica scales easily using K8S pods |
[9] | Daily backup the oldest logs to S3 to maintain stable log working set size | Add backup and no easy process for re-using the backup log data | Directly operate on S3 storage– Searching, event capturing, AI/ML analysis, etc. |
In summary, this article describes an operating log infrastructure setup on AWS. It also presents a similar Apica log infrastructure setup. Apica log management infrastructure removes induced engineering overhead because of its competitive advantage in the native use of S3 storage.
There are plenty of references on the web about scaling elastic search service and the common consensus from these references is such a task is not for the faint heart.