Originally published at The New Stack
Most log management solutions store log data in a database and enable search by storing an index of the data. As the database grows in size, so does the index management cost. On a small scale, this isn’t problematic. But when dealing with large-scale deployments, organizations end up using lots of compute, storage and human resources just to manage their indexes, in addition to data itself. When companies are handling terabytes of data every day, the database-backed log management system becomes untenable.
Another common issue is that most log solutions don’t store just one set of data. Many DIY log management implementations use popular databases such as MongoDB, ElasticSearch and Cassandra. Let’s take ElasticSearch as an example. An ElasticSearch cluster runs several replicas of data in the hot store tier to ensure high availability. Even with data compression, the replication required to keep the data available still dramatically increases the total amount of storage necessary. The problem is magnified when you account for storage needed for indexes.
Clustering also increases the management complexity and requires users to understand how to manage node failures and data recovery. Even with replication, it is impossible to immediately spin up a new instance when an instance goes down. In most cases, there is some downtime when the log analytics system becomes unavailable. While this happens, data continues to come in because logs are generated in real-time. Catching up requires additional provisioning of resources. Because the real-time data never stops, it can be hard to get the log analytics system to catch up. One-click elasticity is critical to managing this at scale.
The challenges outlined above are classic examples of hidden “storage operations tax” that any DIY solution has to pay. The larger the scale, the higher the tax! A company ingesting around one terabyte of data per day would need multiple terabytes of storage and a proportional amount of RAM if they wanted to keep 30 days worth of log data searchable.
The way to solve this problem is by moving away from databases and using a scalable API storage layer. An API storage layer like Amazon Web Services‘ S3, which has traditionally been used for cold storage, fits this requirement quite well. It provides high availability and durability, infinite scale, the lowest price per GB and effectively takes your storage operations tax to zero. However, to make this work, one has to ensure that applications do not have the higher latency that is typical with cold storage.
Are You Keeping 30 Days’ Worth of Data?
Enterprises think they are keeping 30 days’ worth of log data in their hot storage, but they aren’t actually doing so. Most queries are in the form of periodically run reports that are not interactive with a user sitting at the console. This is especially true at scale when it is not uncommon to ingest hundreds of megabytes or gigabytes of log data in a minute. Interactive workflows in such environments focus on identifying relevant events and data patterns that are then programmed into a machine and converted to timely real-time notifications to the administrator. This means that most data does not need to be in hot storage at all but rather can be processed in-line during ingest or asynchronously at a later point in time.
There’s another good reason that companies move data into S3-compatible or other cold storage quickly. Reducing data duration in a database separates the data storage from compute and makes it easier for organizations to scale their storage and recover from crashed clusters. It’s dramatically cheaper to store data in cold storage than in a database, and scaling cold storage is easier than scaling a database.
This approach, however, creates a new problem where we need to separate data into multiple tiers; hot and cold. Moving and managing data between the two tiers requires expertise. Considerations around what to tier, how often to move data and when to hydrate the hot tier with data from the cold tier now become business as usual. The “storage operations tax” just went up.
What if I Need Long-Term Data Retention?
In highly regulated environments, short-term retention is usually not an option as businesses must store data, index it and make data searchable for several years. The same problems exist, albeit at an even larger scale. The choice is between vast amounts of expensive primary storage or tiered storage architecture. With such requirements, it is not uncommon to have the tiered implementation with most of the data sitting in the cold tier, yet with significant data still in the hot tier (e.g., 30-day retention). The “storage operations tax” isn’t going anywhere, just increasing.
Eliminating Legacy Storage Architecture and Data Tiering
Companies use a tiered approach to storage because they fear losing the ability to search data in cold storage. If searching is necessary, an arduous request process makes accessing the logs slow and challenging. Running real-time searches on older data is impossible. For some application types, this isn’t a big deal. Still, for revenue-producing, critical path applications, it’s crucial to have quick, real-time access to logs and the ability to get the information out of them at a moment’s notice. Having multiple data tiers, where there is a “hot” store and a “cold” store, creates cost and management overhead, particularly for Day 2 operations. Moving everything to a hot store would be extremely expensive — so what if you could make cold storage your primary store?
Making S3 Searchable or ‘Zero Storage Operations Tax’
What if we could make S3-compatible storage just as searchable as a database? The reason companies keep their log data in a database is to enable real-time searches. Still, in practice, most organizations are not keeping nearly as much historical data in databases as their official data retention policies dictate. Suppose any S3-compatible store can be just as searchable as a database. In that case, organizations can dramatically cut down the amount of data stored in databases and the accompanying computing resources needed to manage that data. The most recent data — say, one minute of data — can be stored on the disk, but after a minute, everything moves to S3. There’s no longer the need to run multiple instances of a database for high availability because if the cluster goes down, a new one can be spun up and pointed to the same S3-compatible bucket.
Moving log data directly to cold storage while ensuring real-time searchability makes it easier to scale, increases the log data’s availability and dramatically decreases costs, both on storage and computing resources. When log data is accessed directly in the cold storage, users don’t have to worry about managing indexes between hot and cold store tiers, rehydrating data, or building complex policies. It also means that companies can follow the data retention plans they have to ensure developers can access logs and use them to debug critical applications.