d

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore.

15 St Margarets, NY 10033
(+381) 11 123 4567
ouroffice@aware.com

 

KMF

Observability Recipes – DZone Web Dev

What Is Observability?

Observability is the ability to derive a valid conclusion of what is happening currently to the system and why it is happening.

Guiding Principles for Observability 

  1. Context and sequential flow of each end-tend-end request is most important. We need to be able to see what is having an issue, which other parts might/are affected and what are the commonalities of issues when  things go wrong.
  2. Must be able to cut the data in many ways and correlate the different aspects of a request (e.g. ability to filter for each user, their session, each server node and any of them combined with  the other attributes)
  3. Use questions to drive features required for observability instead of relaying on what we can see.

Observability Components

ComponentsWhat is means?
MetricsMetrics are numeric values to help evaluate a service’s overall behavior over time.

They compromise of a set of data points that can be used to derive system’s performance.

Typical examples are:

  • uptime
  • response time
  • # request per second
  • CPU/RAM utilisation
EventsAn event is a collection of data points about what it took to complete a unit of work. they are records of selected significant points that happened with metadata to provide context.

Typical examples are:

  • change of a workflow status
  • batch job completion
LogsLogs are important for troubleshooting and trying to understand a problem. they provide detail data and context so one can re-create and diagnose a problem

Typical examples are:

  1. application logs
  2. serer logs
  3. error logs
  4. debug logs
TracesTraces are important for showing a step-by-step journey of how a request or action as it moves through the system. these give specific insight into the flow and help one to identify errors, find bottlenecks so they can be optimised and rectified.
VisualisationData needs to be connected in a visual and easy to comprehend approach that allows data to be correlated and derive connections from the different data points and events that is happening in the system. This provides context that  are otherwise not easily identifiable by looking at individual metrics alone.

Observability Recipes

Breaking Observability components into recipes, and starting with what questions to answer. We can then list out the data points require to answer them. 

Using this approach, we can easily map out what the gaps are for an existing system, or use it as an implementation pattern for a new system.

An example is as follows: 

RecipesTypical questions to askComponents
MetricsEventsLogsTracesVisualisation
System Health
  • Are all components of my system up and running?
  • Are my servers reaching max capacity?
  • Resource details
    • Total/Consumed/Avail. CPU/RAM/Storage
  • Network statistics
    • latency
    • throughput
    • packet loss

Service start/stop statusN/AN/ASystems health dashboard
Application performance
  • How fast/slow are my application services running?
  • Why has system performance degrade over the past few months?
  • response time (min/max/avg)
  • response payload size (min/max/avg)
events with metrics as payload or having relevant entries in logsN/ACore components/service list with top 10 slow transactions
User experience
  • what are the browsers that my users are using?
  • What are their on-page performance?
  • How are the users using the system?
  • Applicaton
    • End user response time (min/max/avg)
    • Page load complete time
  • HTML
    • DOM size
    • Ajax request response time (min/max/avg)
  • User profile
    • IP address
    • country
    • browser type
  • User journey
    • entry page
    • exit page
    • time spent on page

Exception management
  1. Are there any errors in my application services?
  2. What is a functionality/page broken or not working?
  • Server request
  • HTML/DOM
    • page views with JavaScript errors
  • error events with following:
  • Standardised error code
  • Date/time of error
  • unique user/session id
  • Standardised error code
  • date/time of error
  • unique
  • user/session id
  • user friendly error message
  • stack trace
detailed object call graph of a request

Dashboard provides:

  • summary view of all error/exception stats
  • list view of error events
  • ability to drill down to each call for troubleshooting

Besides the above recipes, these a couple that should be considered as well:

  1. Release management
    • Why did the release of feature “x” failed?
    • What went wrong during the release?
    • Why did a release take so long to deploy into production?
  2. Security monitoring
    • Are there are security breaches?
    • Are there any abnormal user behavior?
    • Are there any new vulnerabilities to my current system?

Credit: Source link

Previous Next
Close
Test Caption
Test Description goes like this