- 
                Notifications
    
You must be signed in to change notification settings  - Fork 315
 
Add long running traces to flare report, allow flare files to be downloaded with JMX #9874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Open
      
      
            deejgregor
  wants to merge
  9
  commits into
  DataDog:master
  
    
      
        
          
  
    
      Choose a base branch
      
     
    
      
        
      
      
        
          
          
        
        
          
            
              
              
              
  
           
        
        
          
            
              
              
           
        
       
     
  
        
          
            
          
            
          
        
       
    
      
from
deejgregor:pending-traces-jmx-dump
  
      
      
   
  
    
  
  
  
 
  
      
    base: master
Could not load branches
            
              
  
    Branch not found: {{ refName }}
  
            
                
      Loading
              
            Could not load tags
            
            
              Nothing to show
            
              
  
            
                
      Loading
              
            Are you sure you want to change the base?
            Some commits from the old base branch may be removed from the timeline,
            and old review comments may become outdated.
          
          Conversation
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
    Synchronized accesses to traceArray in LongRunningTracesTracker since the flare reporter can now access the array. This shouldn't be a concern for blocking because addTrace and flushAndCompact are the existing calls from PendingTraceBuffer's run() loop and getTracesAsJson is called by the reporter thread and will complete fairly quickly.
…ature This allows dumping long running traces when not connected to a Datadog Agent using the new JMX flare feature. A warning message will be logged in this case to indicate that long running traces will not be sent upstream but are available in a flare. Previously the long running traces buffer would always be empty, even though the feature was enabled with dd.trace.experimental.long-running.enabled=true. This led to a good amount of confusion when I was initially developing a feature to dump long running traces without a local Datadog Agent running.
The JMX telemetry feature is controlled by dd.telemetry.jmx.enabled
and is disabled by default. It enables JMXFetch telemetry (if
JMXFetch is enabled, which it is byd default) and also enables a
new tracer flare MBean at datadog.flare:type=TracerFlare. This new
MBean exposes three operations:
java.lang.String listFlareFiles()
- Returns a list of sources and files available from each source.
java.lang.String getFlareFile(java.lang.String p1,java.lang.String p2)
- Returns a single file from a specific reporter (or flare source).
- If the file ends in ".txt", it is returned as-is, otherwise it is
  base64 encoded.
java.lang.String generateFullFlareZip()
- Returns a full flare dump, base64 encoded.
An easy way to enable this for testing is to add these arguments:
    -Ddd.telemetry.jmx.enabled=true
    -Dcom.sun.management.jmxremote
    -Dcom.sun.management.jmxremote.host=127.0.0.1
    -Dcom.sun.management.jmxremote.port=9010
    -Dcom.sun.management.jmxremote.authenticate=false
    -Dcom.sun.management.jmxremote.ssl=false
To test, you can use jmxterm (https://github.com/jiaqi/jmxterm) like
this:
echo "run -b datadog.flare:type=TracerFlare listFlareFiles" | \
    java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
        -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent
echo "run -b datadog.flare:type=TracerFlare getFlareFile datadog.trace.agent.core.LongRunningTracesTracker long_running_traces.txt" | \
    java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
        -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent | \
    jq .
echo "run -b datadog.flare:type=TracerFlare generateFullFlareZip" | \
    java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
        -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent | \
    base64 -d > /tmp/flare.zip && \
    unzip -v /tmp/flare.zip
    …ng priority This likely isn't an important metric to track, but I noticed these traces were the only ones not reflected in existing LongRunningTraces metrics, so I thought it might be good to add for completeness.
| 
           Jira card for context: APMS-17557  | 
    
Move config option to GeneralConfig (TracerConfig was the wrong place).
Add integration test for flare JMX support.
Dump at most 50 long running traces
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
      
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
What Does This Do
This does two main things:
Examples:
Motivation
While adding custom instrumentation to a complex, asynchronous application we found it was challenging to validate if all spans were
end()ed during tests.dd.trace.debug=trueanddd.trace.experimental.long-running.enabled=truecould be used with some post-processing of debug logs, however this didn't work for our needs because the application breaks with that level of logging. Whendd.trace.experimental.long-running.enabled=trueis used, the long running traces are sent to Datadog's backend, however they are not searchable until they are finished, so we didn't have a good way to find them. This change gives us two ways to access the long running traces list with either a flare report or via JMX.I initially started by adding JMX MBeans to retrieve just the pending and long running traces and counters. Once I added the long running traces to the flare report to parity with pending traces, I realized that a more generic mechanism to allow getting flare details over JMX might be useful. After adding a TracerFlare MBean, this seemed like a far more valuable route and I removed the code I had added for pending/long running trace MBeans.
Additional Notes
Outstanding items
dd.telemetry.jmx.enabled=true.MAX_DUMPED_TRACES = 50).This PR has a number of commits and I suggest reviewing commit-by-commit, paying special attention to the notes in bold below:
synchronizedto a few methods (see commit comment for details).features.supportsLongRunning()is false, the traces are kept in theTRACKEDstate, compared to theNOT_TRACKEDstate previously.add*methods as-is, but this could be simplified by refactoring the add* methods into Reporter instances (with a new signature that passes a few more arguments toaddReportToFlare). I think this refactoring would be a good change to make--let me know and I'll happily do that. I also considered not making the zip file an intermediary, and if you like, I could look at what that change might be, as well.And three newer fixup commits:
Contributor Checklist
type:and (comp:orinst:) labels in addition to any useful labelsclose,fixor any linking keywords when referencing an issue.Use
solvesinstead, and assign the PR milestone to the issueJira ticket: [PROJ-IDENT]