Presentation: Top Node.js Metrics to Watch

Fresh from Germany’s largest Node.js Meetup, hosted by Wikimedia in Berlin, is the latest presentation from Sematext DevOps Evangelist Stefan Thies — “Top Node.js Metrics To Watch”. The event was shared with the Node.js Meetup in London via video-live stream, the full recording is available on YouTube.

Stefan’s talk goes through challenges of developing Node.js monitoring solutions, such as SPM for Node.js  and key metrics including examples from monitoring Kibana’s Node.js Server

Here is the video:

and the slides:

Please note, the article “ Top Node.js Metrics to Watch” was originally published by O’Reilly Radar.

Have a look at our other Node.js posts — there is a lot more interesting material to discover, like  MongoDB monitoring made with Node.js.

Questions or Feedback?
If you have any questions or feedback for us, please contact us by email or hit us on Twitter.  We love talking about performance monitoring — and log management!

 

Docker Swarm: Collecting Metrics, Events & Logs

Docker Swarm is a cluster manager for Docker.  When accessed via the Docker API by Docker API Clients or Docker command line tools, a Docker Swarm cluster looks just like a single Docker Host.  Docker Swarm distributes containers to multiple nodes using various deployment strategies in the cluster scheduler.

Having in mind that a Swarm cluster looks like a single Docker Host from the API point of view, it should be very easy to monitor Docker Swarm with existing Docker monitoring tools!  Connecting a monitoring agent to the Swarm Master API endpoint should do the job, right? The Sematext Docker Agent could simply collect all container metrics, events and all logs from the Swarm Master – should be a piece of cake. Hm, but could there a gotcha?  It turns out there is more than one:

  • If we deploy a single monitoring agent to the master node, it would miss host metrics for all other nodes because the Docker API doesn’t provide any host metrics. We could also not see how much memory, disk space or CPU the Docker Swarm node itself consumes. Solution: deploy the monitoring agents to each node for collecting the metrics locally.
  • Assuming a larger cluster with a high volume of logs, events and metrics to collect, a single monitoring agent connected to the the master node would need to handle all operational data of the cluster.  This would work for a small cluster but such an architecture would obviously be destined for failure on larger clusters.  Guess what the solution is? It’s much better having an agent running on each node and distributing the monitoring and logging work over all nodes. If you do it right from the beginning, there is no need to change the deployment strategy later, when the cluster scales out.
DockerSwarmMonitoring
Monitoring container running on each Docker node

In the following example we assume that the master and agent nodes have the UNIX socket enabled in Docker daemon settings. This can be achieved by using –engine-env ‘DOCKER_OPTS=”-H unix:///var/run/docker.sock”‘ in the docker-machine create command. Use this Github Gist to create a Docker-Swarm Cluster with with enabled UNIX sockets. Later, we will see this helps simplify the deployment of any tool that needs to connect to the local Docker daemon – including monitoring and logging containers.

Let’s see how to deploy Sematext Agent to each node in a Docker Swarm Cluster with UNIX socket enabled in Docker-Daemon as just described.

When we started to work on Swarm Monitoring our first question was “Does Docker Swarm provide a deployment strategy for running exactly one instance of a service on each node?” We checked the documentation, but no dice.  We found strategies like “spread, binpack, and random” (see https://docs.docker.com/swarm/scheduler/strategy/), but none of them would guarantee exactly one instance of a service on each node. The “spread” strategy spreads the containers evenly over all hosts. The “binpack” strategy fills up one node after another with containers, while “random” spreads containers randomly to nodes. There was seemingly no strategy suitable for monitoring services running only once on each node.

So how can we distribute the monitoring container to each host using Docker Swarm instead of bash script iterating over all nodes?  It turns out it’s possible to define an affinity to ensure that containers that should run on the same host are scheduled together. In our case we use “anti-affinity” in the deployment strategy, which instructs Swarm not to deploy the container with Sematext Agent to hosts that already have that container running. In other words, it tells Docker Swarm to run no more than one Sematext Agent container on each Docker host.  To do that we define a docker-compose.yml file with the “anti-affinity” specified in the container environment section:

sematext-agent:
  image: 'sematext/sematext-agent-docker:latest'
  environment:
    - LOGSENE_TOKEN=3b549a2c-653a-4832-xxx
    - SPM_TOKEN=fe31fc3a-4660-47c6-xxx
    - affinity:container!=sematext-agent* 
  privileged: true
  restart: always
  volumes:
    - '/var/run/docker.sock:/var/run/docker.sock'

Finally, we use the docker-compose command to scale out the Sematext Docker Agent and deploy it to all Swarm cluster nodes.  To do that we run:

eval $(docker-machine env swarm-master --swarm)
docker-compose up -d 
# scale is == num nodes
docker-compose scale sematext-agent=$(docker-machine ls | grep swarm | grep Running | wc -l)

After running the above commands, Sematext Docker Agent will be running on each node and within a minute you will receive Host and Container Metrics for all containers, all their Logs and all Docker events from all nodes in your Docker Swarm cluster.  Complete visibility!

Bildschirmfoto 2016-01-12 um 15.36.01
Aggregated Metrics from all Docker Swarm nodes 

Please note there are many ways to create a Swarm cluster and you might have another setup, such as:

  • TLS secured Docker daemon and no possibility to activate the unix socket: In this situation you have to deal with the existing Docker daemon setup, which typically uses TLS and authentication via certificates (for example, if you followed Docker’s instructions to create Swarm clusters using Docker-Machine). When the Docker socket is secured with TLS, each client – including Sematext Docker Agent – needs the certificates for authentication. This involves a bunch of parameters such as “DOCKER_HOST”, “DOCKER_CERT_PATH”, “DOCKER_TLS_VERIFY” and mounting of the certificate into the container. In addition we should know to which Docker daemon the agent should be connected (typically port 2375 for TCP, 2376 for TLS on each node and port 3376 on Swarm Master nodes for the Swarm API). We made this scenario easy with a deployment script for the Sematext Agent with TLS options provided by Docker-Machine.
  • You use CoreOS to run Docker Swarm: In this case you could use fleet and systemd to distribute the agent to each node (simply install Sematext Agent with these instructions)

The deployment methods above should work for other monitoring tools or logging containers as well because most of such tools need to run on each node to collect the metrics locally.

If you have questions or special needs for monitoring more complex setups feel free to contact us. The Sematext Docker Agent is a turnkey-solution for Docker Logs, Metrics and Events – sign up here and give it a try (30-days free trial, no credit card needed).

Introducing NetMaps

New Year, New Feature in SPM!  We are happy to announce the immediate availability of NetMaps in SPM!  Check out why they are useful or watch the short video below.

Ever wondered how different components of distributed apps are actually connected over the network? When it comes to troubleshooting of distributed application stacks like Apache Kafka, Spark, Hadoop, Cassandra,  Solr, or Elasticsearch — not to mention Microservice architectures or Docker Containers — information about the deployed infrastructure becomes critical. That architecture diagram you drew N months ago?  It’s probably out of date.  Apps we run today are often very dynamic. Instances, nodes, and containers come and go, whether because of elastic up/down scaling or other reasons.

Discovering This Dynamic Infrastructure

Watching the actual network traffic on all nodes could quickly answer many questions for DevOps engineers doing troubleshooting or planning setup changes. For example:

  • Which nodes are online and active?
  • How nodes are connected to other nodes?
  • What are the dependencies between network services?
  • What is the consumed bandwidth between nodes?
  • Which applications run on a specific network node?

Visualize Network Connections

Designed to visualize network connections and answer the above questions instantly, NetMaps also include:

  • Automatic Discovery of network nodes and applications
  • Filtering by application and host name
  • Automatic Visualizations as Network Map and Chord Diagrams
  • Interactive Explorer for following network links for each application node
  • Bandwidth consumption for all incoming and outgoing network connections
  • Navigation from the NetMap to all nodes and related performance metrics of the monitored App

The best practice is to activate network monitoring on all application server nodes, which communicate with databases, message brokers, search engines etc. in that way it is easy to see how client applications communicate with backend servers.

NetMap “Map” View

NetMap_map

NetMap “Chord” View

NetMap_chord

It is very easy to activate Network Monitoring in SPM Client, a collector for Host and Application Metrics. Intelligent network filters ensure that the resource usage for the network monitoring stays low while capturing all relevant packets to explore your infrastructure using the “NetMap” Tab in SPM. If you find network maps interesting, you might also be interested in SPM’s AppMap feature for JVM applications to discover relationships between monitored JVM applications such as Elasticsearch, Solr, Cassandra, Spark or Kafka, …

We hope you like this new addition to SPM.  Got ideas how we could make it more useful for you?  Let us know via comments, email or @sematext.

Not using SPM yet? Check out the free 30-day SPM trial by registering here.  There’s no commitment and no credit card required.

SPM vs. New Relic APM – Performance Monitoring Solution Comparison

If you’ve found your way to this post then chances are high that you’re having second thoughts about diving into a New Relic APM subscription.  You’re not alone.  In fact, we hear from many fellow DevOps engineers looking at performance monitoring solutions who check out New Relic APM — or who are already using it — because it is so widely known, but wonder if there is a better tool, specifically with traits like:

  • Better pricing
  • On Premises deployment (not just SaaS)
  • Integration of metrics, logs and events in a single UI
  • Across-the-board anomaly detection

In all of the above cases — and others — SPM meets or exceeds New Relic APM.  This SPM vs. New Relic APM comparison document has all the details.

SPM_NR_header

So why not give SPM a try?  You can check out a free 30-day trial by registering here.  There’s no commitment and no credit card required.  Even better — combine SPM with Logsene to make the integration of performance metrics, logs, events and anomalies more robust for those looking for a single pane of glass.

MongoDB Monitoring Support

For many of us in the DevOps field, MongoDB is a critical part of our IT stack.  With today’s acquisition of WiredTiger, MongoDB is further establishing itself as the NoSQL DB built to support massive data processing and storage.  It would be an understatement to say that MongoDB does a lot, with many organizations using it as their backend storage framework, analytics backend, and so on.

So your MongoDB cluster really, really needs to be in tip-top shape.  All the time.  And if it’s not then you need to know asap — or better yet — prevent problems before they kick in and make your life difficult.  That’s where SPM comes in — with MongoDB monitoring, alerting and anomaly detection.  MongoDB exposes a boatload of metrics, but instead of just throwing all of them on endless charts, we’ve taken the time to cherry pick what we think are the top 50 most valuable MongoDB metrics to monitor. We have furthermore made it possible to filter the MongoDB metrics by server, as well as a database and table where possible.

The key metric groups we track are:

  • Database Operations
  • Database Memory
  • Database Storage
  • Documents
  • Locks
  • Network
  • Database Journal
  • Background Flushes

The Overview chart below provides 9 charts with MongoDB key metrics:

  • Row 1 displays CPU, Memory and Disk Metrics
  • Row 2 displays Database Operations, Database Memory and Database Storage Metrics.
  • Row 3 adds Collection/Document Metrics, Locks, and wait times; followed by Network Metrics for MongoDB

MongoDB_Overview

SPM for MongoDB Overview

In case you monitor a MongoDB cluster, the Server Tab provides a quick overview for the Health of each node:

MongoDB_Server_view

SPM Server View

The Reports on the left side of the screen below provide detailed information for each group of metrics. Let’s have a quick look at them.

MongoDB_CPU_details

OS Metrics: CPU Metrics, Memory Usage, Disk Space and I/O

Below is an example of some of the key MongoDB Metrics found in SPM:

  • Database Operations: Counters for Queries, Insert, Update, Delete and other commands for the main database plus replica operations
  • Database Memory: Resident-, Virtual-, Mapped-, and Journal Memory
  • Database Storage: Size of Data Files, Namespace Files, DB Files etc., plus Size of Objects, Number of Collections and Objects

MongoDB_Storage

MongoDB Storage & Collections

The screenshot below shows:

  • Documents: Counters for Documents inserted, updated or returned by queries
  • Locks: Lock counters and lock acquisition wait times for Global, Database, Collection and Journal level. Since MongoDB 3.x Locks are not always global. SPM shows a breakdown for all lock types. These metrics are good candidates for alerting, when anomalies are detected.  Simply add an alert from the menu in the top-left corner in each chart.

MongoDB_Locks

Metrics for all MongoDB Locks

Other key MongoDB metrics that SPM displays are:

  • Network: Number of client connections, Received and transmitted data, Request rate
  • Database Journal: Commits, Early Commits,  Commit times and lock times

MongoDB_Journal_Commits

MongoDB Journal Metrics

In case you like to see MongoDB metrics together with the Top Node.js Metrics, you might like the idea of putting MongoDB and Node.js metrics from SPM for Node.js in a custom dashboard:

MongoDB_Locks-Node.js_Loop

SPM Custom Dashboard with MongoDB Locks and Node.js Event Loop Latency

We hope you like this new addition to SPM.  Got ideas how we could make it more useful for you?  Let us know via comments, email or @sematext.

Not using SPM yet? Check out the free 30-day trial by registering here.  There’s no commitment and no credit card required.  Even better — combine SPM with Logsene to make the integration of performance metrics, logs, events and anomalies more robust for those looking for a single pane of glass.

Docker + Solr How-to: Monitoring the Official Solr Docker Image

The official Solr Image on Docker Hub was released just a few weeks ago and already has 16K pulls. Why not more? Well, there are more than 200 different Solr images on Docker Hub — probably because no official Image was available!

A rapidly growing number of organizations are using Solr and Docker in production and they are probably happy about the new official Image. Needless to say, monitoring Solr is essential in production. Docker is disruptive in many ways, and there are many things that are slightly different and worth mentioning.  These include:

  1. Changed deployment for Solr and its monitoring tools using Dockerfile, Docker Compose or various Orchestration Tools
  2. There is a new Layer to monitor: Container Metrics and Events, see: Docker Events and Metrics monitoring and SPM for Docker
  3. Logging has changed: containers log to the console and logs need to be retrieved from Docker-Daemon instead getting them from the Solr log file.  Check out our post on the subject: Innovative Docker Log Management
  4. Official Images may not provide options for monitoring (such as JMX).  However, the official Image for Solr provides an option to pass parameters to the Java Runtime Environment.  We we will use this option for Solr monitoring in this post.

Next, I’m going to demonstrate the setup of a Solr node with SPM. The final setup will provide the full Solr & Docker Monitoring and Logging package:

  • Detailed Application Metrics for Solr, deployed on Docker
  • Detailed Container Metrics and Docker Events
  • Centralized Logs for all Containers by SPM for Docker

Let’s first decide on one of the following options to monitor Solr on Docker:

  1. Build your own Solr container with a mix of open-source monitoring/alerting tools. I’m not going to go into detail about this option today because dealing with a mix of open-source DevOps tools and a non-official Solr image doesn’t sound clean; plus, we can do better.
  2. Use a standalone monitoring agent, which queries metrics from the Solr container. This requires a setup for JMX and Docker networking configurations for the monitor and Solr. The metrics gathered by remote agents are limited and, in the Docker context, running an external monitoring process plus Solr processes consumes more resources.  And the next option …
  3. Inject an SPM in-process monitoring agent into Solr. This option has the lowest resource usage and has support for advanced monitoring functions like Transaction Tracing and AppMap.

We’ll go with Option #3 in this blog post, as it provides the best insights into Solr.  Sematext provides the SPM Client (this includes the monitoring agent and metrics sender) pre-installed in a Docker Image.  We refer to this dockerized SPM Client as “SPM Client Image/Container” in the following instructions.  The main trick here is to mount a volume from SPM Client Container into Solr Containers in order to load the monitoring library that’s part of the SPM Client Container.

Let’s have a look at the desired setup and how to get there:

SPM-Solr-Docker-Schema
Monitoring Setup

We’ll use the latest Docker-Compose Version (> v1.5) because we can than use environment variables substitution in Docker-Compose.

1) Configure and start SPM-Client Container

The SPM Token is a unique identifier for monitored applications – if you haven’t created an SPM App for Solr, then create one here first. Should take about 37 seconds.

# Set the SPM Token as Environment Variable
export SPM_TOKEN=4feb144c-4da8-4081-83b5-b0b8e06e743a
# Set the JVM Name, which appears in SPM JVM Metrics Report
# In addition we will use it as Hostname for the Solr container
export JVM_NAME=SOLR1

2) Create SPM Client and Solr service in docker-compose.yml Note: you may copy this file to make changes for additional Solr options; all parameters are set as Environment Variables.

spm-client-solr:
 image: sematext/spm-client
 container_name: spm-client-solr
 hostname: spm-client-solr
 environment:
 - SPM_CONFIG=${SPM_TOKEN} solr javaagent ${JVM_NAME}

SOLR1:
 image: solr
 hostname: solr1
 ports:
 - "8983:8983"
 volumes_from:
 - spm-client-solr
 environment:
 - SOLR_OPTS=-Dcom.sun.management.jmxremote -javaagent:/opt/spm/spm-monitor/lib/spm-monitor-solr.jar=${SPM_TOKEN}::${JVM_NAME}
 command: bin/solr -f

In the Environment variable “SOLR_OPTS” in the Docker-Compose file above we see options for the SPM in-process monitor to inject a .jar file from the SPM Client Volume.  The SOLR_OPTS string is taken from SPM install instructions.  It includes the SPM Token (the ${SPM_TOKEN} part) and provides the JVM name so we can distinguish between multiple Solr instances if we run N of them on the same host (the ::${JVM_NAME} part).

3) Run Solr and SPM Monitor  

We are now ready to fire up Solr:

    docker-compose up -d

Solr_image_code

All done! After about a minute, metrics for the Docker Host, JVMs and Solr nodes will appear in SPM.  Because we chose a consistent naming for Container hostname, and JVM name we can immediately see, in every chart, the relevant filters named “SOLR1”.  This is much better than some random Container IDs.

Solr_image_screen_4

Solr Metrics Overview

But what about my Solr Logs and the Container Metrics?

Simply run SPM for Docker – it collect logs as well as container and host metrics.  It can also parse Solr logs and store them in Logsene (see Logsene 1-Click ELK Stack), which is awesome because it means you can have both Solr/OS/JVM metrics AND Solr logs all in one place!  Or do you maybe like to ssh to your servers and grep log files?

Docker Logs & Metrics Steps:

First we create the SPM App with the type “Docker” for Docker-specific metrics and then we create a Logsene App for our logs. Then we use the generated App Tokens to run Sematext Agent for Docker.

docker run -d -name sematext-agent -e SPM_TOKEN=SPM_DOCKER_APP_TOKEN -e LOGSENE_TOKEN=LOGSENE_APP_TOKEN sematext/sematext-agent-docker

After a few minutes, you will get Host and Container Metrics together with Events and Logs in SPM, as shown here:

Solr_image_screen_2

Please note that logs from the containers are automatically shipped and parsed! No setup for log shippers? That is correct — there is NO complicated setup of syslog, Logstash, Docker log drivers, etc.  All this work is done by SPM for Docker. For example, each log line has a “node_name” field for the Solr node. It takes the timestamp, severity, class, thread and source from the Solr log and each log is automatically tagged with the container ID and image name. Moving from SPM Metrics to detailed Solr Logs including Exceptions and parsed Stack Traces is just another mouse click away! Look:

Solr_image_screen_3
Multi-Line Exception, captured and parsed from Solr container

 

solr-logsene

The filters next to field stats on the right side of the screen make it easy to identify containers with the most logs by choosing “container_name”.  That’s just a little detail in the Logsene UI – feel free to explore it by creating Alerts or Kibana 4 Dashboards for your container logs.

Like what you saw here? To monitor Docker and Solr with SPM just get a free account here!  And drop us an email or hit us on Twitter with suggestions, questions or comments.  Solr and Docker are topics we enjoy chatting about with the community!

Top Node.js Metrics to Watch

Monitoring Node.js Applications has special challenges. The dynamic nature of the language provides many “opportunities” for developers to produce memory leaks, and a single function blocking the event queue can have a huge impact on the overall application performance. Parallel execution of jobs is done using multiple worker processes using the “cluster” functionality to take full advantage of multi-core CPUs – but the master and worker processes belong to a single application, which means that they should be monitored together. Let’s have a deep look at the Top Metrics in Node.js Applications to get a better understanding of why they are so important to monitor.

Note: All images in this post are from Sematext’s SPM Performance Monitoring solution and its Node.js integration.

Garbage Collection & Process Memory – Node.js is based on Google’s Chrome V8 Javascript engine. Garbage Collection reclaims memory used by objects that are not longer required. The V8 garbage collection stops the program execution. Incremental GC cycles (scavenging) process only a part of the Heap and are very fast. Full GC cycles deal with objects that survived multiple Incremental GC runs. Full GC runs are executed less frequently to minimize pauses in the program execution.

With regard to garbage collection metrics, we should first measure all the time spent for garbage collection. In addition, it is useful to see how often a full GC cycle — or incremental GC cycle — is executed. The size of heap memory can be compared with the size of the last GC run to see if there is a growing trend. That’s why the following metrics should be monitored:

  • Time consumed for garbage collection
  • Counters for full GC cycles
  • Counters for incremental GC cycles

Nodejs_garbage_collection

Garbage Collection Metrics

Aside from how often GC happens and how long it takes, we can measure the effect on memory by providing the following metrics:

  • Released memory between GC cycles (see above)
  • Process Heap Size and Heap Usage

Nodejs_process_memory

Process Memory Information

Event Loop – The secret of Node.js’s performance is its ability to be CPU bound and use async operations; in that way CPU can be highly utilized and doesn’t waste cycles waiting for I/O operations. This means a server can take many connections and will not be blocked for async operations. As soon as the operation is finished, callback functions are used to continue processing.   The implementation is based on a single event loop, which processes the async function calls in a separate thread. Using synchronous operations drags down performance because other operations need to wait to be executed.  That’s why the golden rule for Node.js performance is “don’t block the event loop”.

The metric to watch is the Latency to process the next event:

  • Slowest Event Handling (Max Latency)
  • Average Event Loop Latency
  • Fastest Event Handling (Min Latency)

Nodejs_slow_avg_fast

Slowest, average and fastest event processing

A high latency in the event loop might indicate the use of blocking (sync) or time-consuming functions in event handlers, which could impact the performance of the whole Node.js application.

Cluster Mode and number of processes – To scale Node.js beyond the capacity of a single process the use of master and worker processes is required – the so called “cluster” mode. Master processes share sockets with the forked worker process and can exchange messages with it. A typical use case for web servers is forking N worker processes, which operate on the shared server socket and handle the requests in round robin (since Node v0.12). In many cases programs choose N with the number of CPUs the server provides – that’s why a constant number of worker processes should be the regular case.  If this number changes it means worker processes have been terminated for some reason.  In the case of processing queues, workers might be started on demand.  In this scenario it would be normal that the number of workers changes all the time, but it might be interesting to see how long a higher number of workers was active. Using a tool like SPM for Node.js lets one track the number of workers. When picking a monitoring solution or developing your own monitoring for Node.js, make sure it is capable of filtering by hostname and worker ID.  Keep in mind Node.js workers can have a very short lifetime that traditional monitoring tools may not be able to handle well.

Nodejs_worker_count

Worker Count

For example, to compare event loop latency in different Node.js sub-processes, we need to be able to select workers we want to compare:

Nodejs_event_loop

Event Loop Latency for each Worker

Web Frameworks – There is a steadily growing number of frameworks to build web services using Node.js.  The most popular are: Express, Hapi.js, Restify, Mean.io, Meteor, and many more. When doing HTTP monitoring, here are some of the key metrics to pay attention to:

  • Response time (http/https)
  • Request rate
  • Error rates (total, error categories)
  • Request/Response content size

Nodejs_HTTP:HTTPS

HTTP/HTTPS Metrics Overview

Of course, Node.js apps don’t run in a vacuum.  They connect to other services, other types of applications, caches, data stores, etc.  As such, while knowing what key Node.js metrics are, monitoring Node.js alone or monitoring it separately from other parts of the infrastructure is not the best practice.  If there is one piece of advice I can give to anyone looking into (Node.js) monitoring it is this: when you buy a monitoring solution — or if you are building it for your own use — make sure you end up with a solution that is capable of showing you the big picture.  For example, Node.js is often used with Elasticsearch (see Top 10 Elasticsearch Metrics to Watch post), Redis, etc.  Seeing metrics for all the systems that surround Node.js apps is precious.  Here is just a small example of a dashboard showing a few Node.js and Elasticsearch metrics together.

Nodejs_combined_dashboard

Combined Dashboard of Node.js and Elasticsearch Metrics

So, those are our top Node.js metrics — what are YOUR top 10 metrics? We’d love to know so we can compare and contrast them with ours in a future post.  Please leave a comment, or send them to us via email or hit us on Twitter: @sematext.

And…if you’d like try SPM to monitor Node.js yourself, check out a Free 30-day trial by registering here.  There’s no commitment and no credit card required. Small startups, startups with no or very little outside funding, non-profit and educational institutions get special pricing – just get in touch with us.

[Note: this post originally appeared on Radar.com]

Death to APM and Logging Silos

Sematext has combined the power of SPM and Logsene in a single pane of glass – a unified view into all the key bits of operational intelligence every DevOps engineer needs: server and application performance metrics, logs, events, anomalies, alerts, ChatOps integrations, etc.  In other words, the whole is greater than the sum of its parts.

Metrics + Logs Correlation using SPM and Logsene Together in One UI

This video demonstrates how the SPM + Logsene combination solves the problems of having too much data to manage yourself and the disconnect when metrics and logs are siloed.  We address two of the most common problems — and their solutions — below the video.

Problem 1 – Big Data, Big Burden: Servers, Containers, Apps, and Devices spew out more and more data: more metrics, more logs, more events. Collecting and storing all this data is a challenge and is often not cheap both in terms of time invested in building large-scale data collection, storage, and retrieval systems, maintaining them, as well as providing the adequate infrastructure to run them.

Solution: Focus on your organization’s core business, your core strength, and outsource needlessly painful or expensive parts to those who specialize in them.  We already outsource all the time, except we don’t call it “outsourcing”: we buy food, we don’t grow or raise it.  We buy cars and don’t build them.  Most of us don’t buy physical servers any more.  Why?  Because others do that better, faster, cheaper.

Problem 2 – Metrics vs. Log Silos: Collecting and visualizing performance metrics and getting alerts when things go awry is great, but performance charts can tell us only so much.  Code instrumentation, like SPM’s Transaction Tracing, goes deeper and provides more insight, but still doesn’t tell us the whole story.  Similarly, collecting logs and being able to search them is very valuable.  Unfortunately, oftentimes APM and log management solutions live in separate silos that don’t really talk to each other.

Solution: Don’t waste your time jumping between multiple disconnected solutions, be they open-source or commercial.  Time is the most precious thing each of us has, and our time as DevOps engineers is very expensive.  Use a tool or service that gives you access to as many bits of operational information that you need as possible.  Not only is this more efficient, and thus cheaper, it’s also much more pleasant than jumping between solutions for browser and terminal, top, vmstat, dstat, less, grep, etc. which are needlessly manual and get boring.

Troubleshooting Doesn’t Need To Take Over Your Life

Troubleshooting production performance issues, dealing with APM alerts and even looking at logs (don’t even think about grepping!) isn’t that hard or time consuming.  Well, as long as you have the right tools, that is.

Correlate_1

Cloud & On Premises Deployments

Unlike most monitoring and logging solutions, Sematext offers both Cloud and On Premises deployments for SPM and Logsene.  We’re happy to discuss package pricing if you’d like to combine both products.

Got ideas how we could make metrics and logs correlation more useful for you?  Let us know via comments, email or @sematext.

Not using SPM and/or Logsene yet? Check out the free 30-day trial by registering here (ping us if you’re a startup, a non-profit, or educational institution – we’ve got special pricing for you!).  There’s no commitment and no credit card required.

Introducing Top Database Operations

If you run Elasticsearch, Solr, or any backend you communicate with using SQL (via JDBC), like SparkSQL, Apache Cassandra (CQL), Apache Impala, Apache Drill, MySQL, PostgreSQL, etc., you’ll like what we’ve just added to SPM.  We call it Database Operations and in SPM you can find it in the new Database report:

If you didn’t watch the video, here’s what Database Operations gives you:

  • Top 5 operation types across all your data stores or filtered to a specific data store type
  • Top 5 operation types by speed, throughput, or simply their volume
  • Time-series reports for volume, throughput, and latency broken down by operation type
  • Ability to view all collected operations, not just the slowest ones, filter by database type or by operation type, sorted by average or total duration, or throughput
  • Sparklines that show last 5 minute values and trends
  • Top 10 slowest individual operations and drill-in details

Integration with Transaction Tracing, so you can correlate slow data store operations with the actual transaction/request that triggered slow operations

Important:

  • To get this information add SPM agent to the application that is talking to a data store (e.g. Solr or Elasticsearch or MySQL or …). This is because the SPM agent captures operations at that client layer, not in the server itself.
  • To start capturing this information enable Transaction Tracing in your SPM agents

This, including Distributed Transaction Tracing, works for all Java applications

Database_ops_1

——-

Database_ops_graphic

Don’t forget – when you enable Database Operations you will also automatically get Transaction Tracing, as well as the cool AppMaps – enjoy! 🙂

Got ideas how we could make Database Operations better and more useful to you?  Let us know via comments, email or @sematext.

Grab a free 30-day SPM trial by registering here (ping us if you’re a startup, a non-profit, or educational institution – we’ve got special pricing for you!).  There’s no commitment and no credit card required.

Introducing Akka Monitoring

Akka is a toolkit and runtime for building highly concurrent, distributed and resilient message-driven applications on the JVM. It’s a part of Scala’s standard distribution for the implementation of the “actor model”.

How Akka Works

Messages between Actors are exchanged in Mailbox queues and Dispatcher provides various concurrency models, while Routers manage the message flow between Actors. That’s quite a lot Akka is doing for developers!

But how does one find bottlenecks in distributed Akka applications? Well, many Akka users already use the great Kamon Open-Source Monitoring Tool, which makes it easy to collect Akka metrics.  However — and this is important! — predefined visualizations, dashboards, anomaly detection, alerts and role-based access controls for the DevOps team are out of scope for Kamon, which is focused on metrics collection only.  To overcome this challenge, Kamon’s design makes it possible to integrate Kamon with other monitoring tools.

Needless to say, Sematext has embraced this philosophy and contributed the Kamon backend to SPM.  This gives Akka users the option to use detailed Metrics from Kamon along with the visualization, alerting, anomaly detection, and team collaboration functionalities offered by SPM.

The latest release of Kamon 0.5.x includes kamon-spm module and was announced on August 17th, 2015 on the Kamon blog.  Here’s an excerpt:

Pavel Zalunin from Sematext contributed the new kamon-spm module, which as you might guess allows you to push metrics data to the Sematext Performance Monitor platform. This contribution is particularly special to us, given the fact that this is the first time that a commercial entity in the performance monitoring sector takes the first step to integrate with Kamon, and they did it so cleanly that we didn’t even have to ask any changes to the PR, it was just perfect. We sincerely hope that more companies follow the steps of Sematext in this matter.

Now let’s take a look at the result of this integration work:

  • Metrics pushed to SPM are displayed in predefined reports, including:
    • An overview of all key Akka metrics
    • Metrics for Actors, Dispatchers and Routers
    • Common metrics for CPU, Memory, Network, I/O,  JVM and Garbage Collection
  • Each chart has the “Action” menu to:
    • Define criteria for anomaly detection and alerts
    • Create scheduled email reports
    • Securely share charts with read-only links
    • Embed charts into custom dashboards
  • A single SPM App can take metrics from multiple hosts to monitor a whole cluster; filters by Host, Actor, Dispatcher, and Router make it easy to drill down to the relevant piece of information.
  • All other SPM features, are available for Akka users, too.  For example:

Akka_overview

Akka Metrics Overview

Actor Metrics

Actors send and receive messages, therefore the key metrics for Actors are:

  • Time in Mailbox
    Messages are waiting to be processed in the Mailbox – high Time in Mailbox values indicate potential delays in processing.
  • Processing Time
    This is the time Actors need to process the received messages – use this to discover slow Actors
  • Mailbox Size
    Large Mailbox Size could indicate pending operations, e.g. when it is constantly growing.

Each of the above metrics is presented in aggregate for all Actors, but one can also use SPM filtering feature to view all Actors’ metrics separately or select one or more specific Actors and visualize only their metrics.  Filtering by Host is also possible, as show below.

Akka_actors

Akka Actors

Dispatcher Metrics

In Akka a Dispatcher is what makes Actors ‘tick’. Each Actor is associated with a particular Dispatcher (default one is used if no explicit Dispatcher is set). Each Dispatcher is associated with a particular Executor – Thread Pool or Fork Join Pool. The SPM Dispatcher report shows information about Executors:

  • Fork Join Pool
  • Thread Pool Executor

All metrics can be filtered by Host and Dispatcher.

Akka_dispatchers

Akka Dispatchers

Router Metrics

Routers can be used to efficiently route messages to destination Actors, called Routees.

  • Routing Time – Time to route message to selected destination
  • Time In Mailbox – Time spent in routees mailbox.
  • Processing Time – Time spent by routee actor to process routed messages
  • Errors Count – Errors count during processing messages by routee

For all these metrics, lower values are better, of course.

Akka_routers

Akka Routers

You can set Alerts and enable Anomaly Detection for any Akka or OS metrics you see in SPM and you can create custom Dashboards with any combination of charts, whether from your Akka apps or other apps monitored by SPM.

We hope you like this new addition to SPM.  Got ideas how we could make it more useful for you?  Let us know via comments, email, or @sematext.

Not using SPM yet? Check out the free 30-day SPM trial by registering here (ping us if you’re a startup, a non-profit, or education institution – we’ve got special pricing for you!).  There’s no commitment and no credit card required.  SPM monitors a ton of applications, like Elasticsearch, Solr, Cassandra, Hadoop, Spark, Node.js (open-source), Docker (get open-source Docker image), CoreOS, RancherOS and more.