(no title)

Top Node.js Metrics to Watch

Monitoring Node.js Applications has special challenges. The dynamic nature of the language provides many “opportunities” for developers to produce memory leaks, and a single function blocking the event queue can have a huge impact on the overall application performance. Parallel execution of jobs is done using multiple worker processes using the “cluster” functionality to take full advantage of multi-core CPUs – but the master and worker processes belong to a single application, which means that they should be monitored together. Let’s have a deep look at the Top Metrics in Node.js Applications to get a better understanding of why they are so important to monitor.

Note: All images in this post are from Sematext’s SPM Performance Monitoring solution and its Node.js integration.

Garbage Collection & Process Memory – Node.js is based on Google’s Chrome V8 Javascript engine. Garbage Collection reclaims memory used by objects that are not longer required. The V8 garbage collection stops the program execution. Incremental GC cycles (scavenging) process only a part of the Heap and are very fast. Full GC cycles deal with objects that survived multiple Incremental GC runs. Full GC runs are executed less frequently to minimize pauses in the program execution.

With regard to garbage collection metrics, we should first measure all the time spent for garbage collection. In addition, it is useful to see how often a full GC cycle — or incremental GC cycle — is executed. The size of heap memory can be compared with the size of the last GC run to see if there is a growing trend. That’s why the following metrics should be monitored:

Time consumed for garbage collection
Counters for full GC cycles
Counters for incremental GC cycles

Garbage Collection Metrics

Aside from how often GC happens and how long it takes, we can measure the effect on memory by providing the following metrics:

Released memory between GC cycles (see above)
Process Heap Size and Heap Usage

Process Memory Information

Event Loop – The secret of Node.js’s performance is its ability to be CPU bound and use async operations; in that way CPU can be highly utilized and doesn’t waste cycles waiting for I/O operations. This means a server can take many connections and will not be blocked for async operations. As soon as the operation is finished, callback functions are used to continue processing. The implementation is based on a single event loop, which processes the async function calls in a separate thread. Using synchronous operations drags down performance because other operations need to wait to be executed. That’s why the golden rule for Node.js performance is “don’t block the event loop”.

The metric to watch is the Latency to process the next event:

Slowest Event Handling (Max Latency)
Average Event Loop Latency
Fastest Event Handling (Min Latency)

Slowest, average and fastest event processing

A high latency in the event loop might indicate the use of blocking (sync) or time-consuming functions in event handlers, which could impact the performance of the whole Node.js application.

Cluster Mode and number of processes – To scale Node.js beyond the capacity of a single process the use of master and worker processes is required – the so called “cluster” mode. Master processes share sockets with the forked worker process and can exchange messages with it. A typical use case for web servers is forking N worker processes, which operate on the shared server socket and handle the requests in round robin (since Node v0.12). In many cases programs choose N with the number of CPUs the server provides – that’s why a constant number of worker processes should be the regular case. If this number changes it means worker processes have been terminated for some reason. In the case of processing queues, workers might be started on demand. In this scenario it would be normal that the number of workers changes all the time, but it might be interesting to see how long a higher number of workers was active. Using a tool like SPM for Node.js lets one track the number of workers. When picking a monitoring solution or developing your own monitoring for Node.js, make sure it is capable of filtering by hostname and worker ID. Keep in mind Node.js workers can have a very short lifetime that traditional monitoring tools may not be able to handle well.

Worker Count

For example, to compare event loop latency in different Node.js sub-processes, we need to be able to select workers we want to compare:

Event Loop Latency for each Worker

Web Frameworks – There is a steadily growing number of frameworks to build web services using Node.js. The most popular are: Express, Hapi.js, Restify, Mean.io, Meteor, and many more. When doing HTTP monitoring, here are some of the key metrics to pay attention to:

Response time (http/https)
Request rate
Error rates (total, error categories)
Request/Response content size

HTTP/HTTPS Metrics Overview

Of course, Node.js apps don’t run in a vacuum. They connect to other services, other types of applications, caches, data stores, etc. As such, while knowing what key Node.js metrics are, monitoring Node.js alone or monitoring it separately from other parts of the infrastructure is not the best practice. If there is one piece of advice I can give to anyone looking into (Node.js) monitoring it is this: when you buy a monitoring solution — or if you are building it for your own use — make sure you end up with a solution that is capable of showing you the big picture. For example, Node.js is often used with Elasticsearch (see Top 10 Elasticsearch Metrics to Watch post), Redis, etc. Seeing metrics for all the systems that surround Node.js apps is precious. Here is just a small example of a dashboard showing a few Node.js and Elasticsearch metrics together.

Combined Dashboard of Node.js and Elasticsearch Metrics

So, those are our top Node.js metrics — what are YOUR top 10 metrics? We’d love to know so we can compare and contrast them with ours in a future post. Please leave a comment, or send them to us via email or hit us on Twitter: @sematext.

And…if you’d like try SPM to monitor Node.js yourself, check out a Free 30-day trial by registering here. There’s no commitment and no credit card required. Small startups, startups with no or very little outside funding, non-profit and educational institutions get special pricing – just get in touch with us.

[Note: this post originally appeared on Radar.com]

Death to APM and Logging Silos

Sematext has combined the power of SPM and Logsene in a single pane of glass – a unified view into all the key bits of operational intelligence every DevOps engineer needs: server and application performance metrics, logs, events, anomalies, alerts, ChatOps integrations, etc. In other words, the whole is greater than the sum of its parts.

Metrics + Logs Correlation using SPM and Logsene Together in One UI

This video demonstrates how the SPM + Logsene combination solves the problems of having too much data to manage yourself and the disconnect when metrics and logs are siloed. We address two of the most common problems — and their solutions — below the video.

Problem 1 – Big Data, Big Burden: Servers, Containers, Apps, and Devices spew out more and more data: more metrics, more logs, more events. Collecting and storing all this data is a challenge and is often not cheap both in terms of time invested in building large-scale data collection, storage, and retrieval systems, maintaining them, as well as providing the adequate infrastructure to run them.

Solution: Focus on your organization’s core business, your core strength, and outsource needlessly painful or expensive parts to those who specialize in them. We already outsource all the time, except we don’t call it “outsourcing”: we buy food, we don’t grow or raise it. We buy cars and don’t build them. Most of us don’t buy physical servers any more. Why? Because others do that better, faster, cheaper.

Problem 2 – Metrics vs. Log Silos: Collecting and visualizing performance metrics and getting alerts when things go awry is great, but performance charts can tell us only so much. Code instrumentation, like SPM’s Transaction Tracing, goes deeper and provides more insight, but still doesn’t tell us the whole story. Similarly, collecting logs and being able to search them is very valuable. Unfortunately, oftentimes APM and log management solutions live in separate silos that don’t really talk to each other.

Solution: Don’t waste your time jumping between multiple disconnected solutions, be they open-source or commercial. Time is the most precious thing each of us has, and our time as DevOps engineers is very expensive. Use a tool or service that gives you access to as many bits of operational information that you need as possible. Not only is this more efficient, and thus cheaper, it’s also much more pleasant than jumping between solutions for browser and terminal, top, vmstat, dstat, less, grep, etc. which are needlessly manual and get boring.

Troubleshooting Doesn’t Need To Take Over Your Life

Troubleshooting production performance issues, dealing with APM alerts and even looking at logs (don’t even think about grepping!) isn’t that hard or time consuming. Well, as long as you have the right tools, that is.

Cloud & On Premises Deployments

Unlike most monitoring and logging solutions, Sematext offers both Cloud and On Premises deployments for SPM and Logsene. We’re happy to discuss package pricing if you’d like to combine both products.

Got ideas how we could make metrics and logs correlation more useful for you? Let us know via comments, email or @sematext.

Not using SPM and/or Logsene yet? Check out the free 30-day trial by registering here (ping us if you’re a startup, a non-profit, or educational institution – we’ve got special pricing for you!). There’s no commitment and no credit card required.

Presentation: Running High Performance and Fault Tolerant Elasticsearch Clusters on Docker

Running Elasticsearch clusters on Docker? Thinking about it? If “yes” then we’ve got a presentation for you that digs deep into the details.

(Note: we’ve also got a related blog post about monitoring the official Elasticsearch image on Docker that you might find useful)

Fresh from DevOps Days in Warsaw and delivered by Sematext engineer Rafal Kuć, “Running High Performance and Fault Tolerant Elasticsearch Clusters on Docker” is chock full of practical information that will no doubt answer many of the questions you might have about this process.

Presentation Topics

Some of the topics Rafal covers include:

Containers vs. Virtual Machines
Running the official Elasticsearch container
Container constraints
Good network practices
Dealing with storage
Data-only Docker volumes
Scaling, time-based data
Multiple tiers and tenants
Indexing with and without routing
Querying with and without routing
Routing vs. no routing
Monitoring

Running High Performance and Fault Tolerant Elasticsearch Clusters on Docker from Sematext Group, Inc.

Here’s a Taste of What You’ll See

How do Containers stack up versus Virtual Machines? There are a lot of elements at play…

Elasticsearch “One-stop Shop”

Sematext is your “one-stop shop” for all things Elasticsearch: Expert Consulting, Production Support, Elasticsearch Training, and Elasticsearch Monitoring with SPM.

Docker Monitoring

Speaking of monitoring…SPM does both Docker monitoring in a sweet little container and Elasticsearch monitoring (and provides alerting and anomaly detection, too), along with many other integrations that DevOps folks find useful.

Enjoy!

Introducing Logsene Live Tail

We’ve been hard at work on our centralized logging SaaS / On-Premises solution – Logsene – and we’re confident the logging fans among us will enjoy the new Live Tail functionality.

Live Tail Benefits

Logsene Live Tail has several important benefits, including:

shows logs in real-time as they arrive into your Logsene app
lets you filter logs to only see those you’re most interested in (e.g., eliminate info logs which are high-volume, noisy and don’t contain critical information)
useful for deploying new software versions; see new errors right away and quickly go in and fix them

Video

Logsene Live Tail is best seen, not told. Check out this short video for a detailed look!

We hope you like this new addition to Logsene. Got ideas how we could make it more useful for you? Let us know via comments, email or @sematext.

Not using Logsene yet? Check out the free 30-day trial by registering here (ping us if you’re a startup, a non-profit, or educational institution – we’ve got special pricing for you!). There’s no commitment and no credit card required. Even better — combine Logsene with SPM to make the integration of performance metrics, logs, events and anomalies more robust for those looking for a single pane of glass.

Introducing Top Database Operations

If you run Elasticsearch, Solr, or any backend you communicate with using SQL (via JDBC), like SparkSQL, Apache Cassandra (CQL), Apache Impala, Apache Drill, MySQL, PostgreSQL, etc., you’ll like what we’ve just added to SPM. We call it Database Operations and in SPM you can find it in the new Database report:

If you didn’t watch the video, here’s what Database Operations gives you:

Top 5 operation types across all your data stores or filtered to a specific data store type
Top 5 operation types by speed, throughput, or simply their volume
Time-series reports for volume, throughput, and latency broken down by operation type
Ability to view all collected operations, not just the slowest ones, filter by database type or by operation type, sorted by average or total duration, or throughput
Sparklines that show last 5 minute values and trends
Top 10 slowest individual operations and drill-in details

Integration with Transaction Tracing, so you can correlate slow data store operations with the actual transaction/request that triggered slow operations

Important:

To get this information add SPM agent to the application that is talking to a data store (e.g. Solr or Elasticsearch or MySQL or …). This is because the SPM agent captures operations at that client layer, not in the server itself.
To start capturing this information enable Transaction Tracing in your SPM agents

This, including Distributed Transaction Tracing, works for all Java applications

——-

Don’t forget – when you enable Database Operations you will also automatically get Transaction Tracing, as well as the cool AppMaps – enjoy! 🙂

Got ideas how we could make Database Operations better and more useful to you? Let us know via comments, email or @sematext.

Grab a free 30-day SPM trial by registering here (ping us if you’re a startup, a non-profit, or educational institution – we’ve got special pricing for you!). There’s no commitment and no credit card required.

Kafka Real-time Stream Multi-topic Catch up Trick

Half of the world, Sematext included, seems to be using Kafka.

Kafka is the spinal cord that connects various components in SPM, Site Search Analytics, and Logsene. If Kafka breaks, we’re in trouble (but we have anomaly detection all over the place to catch issues early). In many Kafka deployments, ours included, the most recent data is the most valuable. Consider the case of Kafka in SPM, which processes massive amounts of performance metrics for monitoring applications and servers. Clearly, in a performance monitoring system you primarily care about current performance numbers. Thus, if SPM’s Kafka pipeline were to break and we restore it, what we’d really like to avoid is processing all data sequentially, oldest to newest. What we’d prefer is processing new metrics data first and then processing older data using any spare capacity we have in order to “fill the gap” caused by Kafka downtime.

Here’s a very quick “video” that show this in action:

How does this work?

We asked about it back in 2013, but didn’t really get good tips. Shortly after that we implemented the following logic that’s been working well for us, as you can see in the animation above.

The catch up logic assumes having multiple topics to consume from and one of these topics being the “active” topic to which producer is publishing messages. Consumer sets which topic is active, although Producer can also set it if it has not already been set. The active topic is set in ZooKeeper.

Consumer looks at the lag by looking at the timestamp that Producer adds to each message published to Kafka. If the lag is over N minutes then Consumer starts paying attention to the offset. If the offset starts getting smaller and keeps getting smaller M times in a row, then Consumer knows we are able to keep up (i.e. the offset is not getting bigger) and sets another topic as active. This signals to Producer to switch publishing to this new topic, while Consumer keeps consuming from all topics.

As the result, Consumer is able to consume both new data and the delayed/old data and avoid not having fresh data while we are in catch-up mode busy processing the backlog. Consuming from one topic is what causes new data to be processed (this corresponds to the right-most part of the chart above “moving forward”), and consuming from the other topic is where we get data for filling in the gap.

If you run Kafka and want a good monitoring tool for Kafka, check out SPM for Kafka monitoring.

Docker + Elasticsearch: How to Monitor the Official Elasticsearch Image on Docker

Update: Elasticsearch 2.x requires a setup with standalone monitors. Why? A very restricitve Java security policy got implemented in Elasticsearch 2.x. This security policy forbids loading of 3rd party libraries – including in-process monitoring libraries from SPM. Here is an example with Docker Compose and SPM-Client standalone monitor, working with Elasticsearch 2.x.

The official Elasticsearch Image on Docker Hub has already generated more than 1.6 million pulls. It is probably the easiest way to get a development setup — which includes Elasticsearch — to the application stack. The reason for this crazy number? A rapidly growing number of organizations are using Elasticsearch and Docker in production. Needless to say, monitoring Elasticsearch is essential in production, and you can find a detailed analysis of this topic (including the “top 10 Elasticsearch metrics to watch”) in the free eBook: ElasticsearchMonitoring Essentials. Docker is disruptive in many ways, and there are many things that are slightly different and worth mentioning. These include:

Changed deployment for Elasticsearch and its monitoring tools using Dockerfile, Docker Compose or various Orchestration Tools
There is a new Layer to monitor: Container Metrics and Events, see: Docker Events and Metrics monitoring and SPM for Docker
Logging has changed: containers log to the console and logs needs to be retrieved from Docker-Daemon instead getting them from the Elasticsearch log file. Check out our post on the subject: Innovative Docker Log Management
Official Images may not provide options for monitoring (such as JMX). However, the official Image for Elasticsearch provides an option to pass parameters to the Java Runtime Environment. We we will use this option for Elasticsearch monitoring in this post. You should also be aware that the official Elasticsearch Image does not include any plugins, and commercial monitoring from Elastic can’t be distributed in this Image for licensing reasons. Our monitoring tool of choice is SPM. If you are not familiar with SPM — but have heard of it — or if you use Marvel, have a look at Marvel vs. SPM.

Next, I’m going to demonstrate a setup to monitor multiple Elasticsearch nodes on a single Docker Host. The final setup will provide the full Monitoring and Logging package:

Detailed Application Metrics for Elasticsearch, deployed on Docker
Detailed Container Metrics and Docker Events
Centralized Logs for all Containers by SPM for Docker

So let’s first decide on one of the following options to monitor Elasticsearch on Docker. You can:

Build your own Elasticsearch container with the included monitoring components. I’m not going to go into details about this option today; rather, I’m going to focus on the official / trusted build.
Use a standalone agent, which queries metrics from the Elasticsearch container. This requires a setup for JMX and Docker networking configurations for the monitor and Elasticsearch. The metrics, gathered by remote agents, are limited and, in the Docker context, running an external monitoring process plus Elasticsearch processes consumes more resources. And the next option …
Inject an SPM in-process monitoring agent into Elasticsearch. This option has the lowest resource usage and has support for advanced monitoring functions like Transaction Tracing and AppMap.

I chose to implement Option #3 in this blog post because it provides the best insights into Elasticsearch. This means the Elasticsearch container needs file-system access to the SPM monitoring agent. Sematext provides the SPM Client (which includes the monitoring agent and metrics sender) pre-installed in a Docker Image, referred as “SPM Client Image/Container” in the following instructions and published on Docker Hub as “sematext/spm-client”. The main trick here is to mount a volume from SPM-Client Container into Elasticsearch Containers in order to load the monitoring library.

Let’s have a look at the desired setup and how to get there:

Monitoring Setup for Elasticsearch on Docker

Continue reading “Docker + Elasticsearch: How to Monitor the Official Elasticsearch Image on Docker”

Elasticsearch “Big Picture” – A Creative Flow Chart and Poster

There are many ways to look at Elasticsearch, but here at Sematext we’re pretty confident that you haven’t seen anything like this flowchart to demonstrate how it works:

Download a copy and print your own Elasticsearch poster!

If you’re looking for something unique to show off your Elasticsearch chops, then download a copy today and print your own. We have files for US letter, A4, Ledger (11”x17”) and Poster (24”x36”) sizes.

Sematext is your “one-stop shop” for all things Elasticsearch: Expert Consulting, Production Support, Elasticsearch Training, Elasticsearch Monitoring, even Hosted ELK!

Doing Centralized Logging with ELK? We Can Help There, Too

If your log analysis and management leave something to be desired, then we’ve got you covered there as well. There’s our centralized logging solution, Logsene, which you can think of as your “Managed ELK Stack in the Cloud.” It’s is also available as an On Premises deployment. Lastly, we offer Logging Consulting should you require more in-depth support.

Questions or Feedback?

If any questions or feedback for us, please contact us by email or hit us on Twitter.

Presentation: Large Scale Log Analytics with Solr

In this presentation from Lucene/Solr Revolution 2015, Sematext engineers — and Solr and centralized logging experts — Radu Gheorghe and Rafal Kuć talk about searching and analyzing time-based data at scale.

Documents ranging from blog posts and social media to application logs and metrics generated by smartwatches and other “smart” things share a similar pattern: timestamps among their fields, rarely changeable, and deletion when they become obsolete. Because this kind of data is so large it often causes scaling and performance challenges.

In this talk, Radu and Rafal focus on these challenges, including: properly designing collections architecture, indexing data fast and without documents waiting in queues for processing, being able to run queries that include time-based sorting and faceting on enormous amounts of indexed data (without killing Solr!), and many more.

Here is the video:

…and here are the slides:

Here’s a Taste of What You’ll See

How do Logstash, rsyslog, Redis, and fast-food-hating zombies (?!) relate? You’ll have to check out the presentation to find out…

Solr “One-stop Shop”

Sematext is your “one-stop shop” for all things Solr: Expert Consulting, Production Support, Solr Training, and Solr Monitoring with SPM.

Log Analytics – We Can Help

If your log analysis and management leave something to be desired, then we’ve got you covered there as well. There’s our centralized logging solution, Logsene. And we also offer Logging Consulting should you require more in-depth support.

Questions or Feedback?

If you have any questions or feedback for us, please contact us by email or hit us on Twitter. We love talking Solr — and logs!

Presentation: Log Analysis with Elasticsearch

Fresh from the Velocity NYC conference is the latest presentation from Sematext engineers Rafal Kuć and Radu Gheorghe — “From zero to production hero: Log Analysis with Elasticsearch.”

The talk goes through the basics of centralizing logs in Elasticsearch and all the strategies that make it scale with billions of documents in production. They cover:

Time-based indices and index templates to efficiently slice your data
Different node tiers to de-couple reading from writing, heavy traffic from low traffic
Tuning various Elasticsearch and OS settings to maximize throughput and search performance
Configuring tools such as logstash and rsyslog to maximize throughput and minimize overhead

Here is part 1 of the Video: