Upload
donga
View
228
Download
0
Embed Size (px)
Citation preview
1
Keystone Real Time Trace Workshop March 2013
Agenda
• Multicore debug and trace features (15 minutes)
• Workshops (120 minutes)
– Setup overview
– WS1: Getting started with DSP trace
– WS2: Customize DSP trace for data tracing
– WS3: Hotspot analysis with function profiling, stalls, and cache analysis
– WS4: Getting started with non-intrusive system trace (STM) SoC profiling
– WS5: Customize SoC profiling for DDR and MSMC bandwidth and latency analysis
– WS6: DDR bandwidth analysis and latency by interfacing embedded APIs (cToolsLib)
– WS7: Command line trace decode interface for offline decode
– WS8: Getting started with Cortex A program execution trace and function profiling
2
Multicore debug and trace features
3
Keystone debug and trace – key goals
• Multicore debug IP within SoC
Efficiency
• Across all KeyStone devices
Consistency
• Development and deployment
Product Life Cycle
• 3P framework and low cost tools
Eco System Enablement
4
Keystone debug & trace strategy
• Debugging cores independently or as a group
• Synchronization and execution correlation
• Runtime visibility in data flows spanning over multiple cores
Multicore Interactions
• CPU loading and load balancing
• Actual application execution sequence
• Cache and CPU stalls impact
Application Optimization
• Bus transactions visibility
• Interface throughput and bottlenecks characterization
Optimize Data Flow in the System
5
Keystone-1 (debug & trace view)
CTools Debug SS
1149.1
STM ETB
Peripherals…
Switch Fabric
Trace Pin Export STM Pin Export
C66x DSP
DSP
SS
JTAG
Trace
AET
ETB
SW Msg
ICEpick
CP_Tracer
DDR
CP_Tracer L2 Memory
CP_Tracer
6
C66x DSP Shannon - 8 Cores Nyquist – 4 Cores Appleton - 4 Cores
XTI
AR
M S
S
Cortex A8
JTAG
SW Msg
HW BP/WP
CTM
ETM Trace
PMU
ETB
Trace Funnel
Replicator
Cortex A8 Shannon - NA Nyquist – NA Appleton - 1 Core
TPIU
Keystone-2 (debug & trace view)
CTools Debug SS
1149.1
STM TBR
Peripherals…
Switch Fabric
Trace Pin Export STM Pin Export
C66x DSP
DSP
SS
JTAG
Trace
AET
ETB
SW Msg
CP_Tracer
DDR
CP_Tracer L2 Memory
CP_Tracer
7
Tetr
is
Cortex A15
JTAG
SW Msg
XTI
HW BP/WP
CTM
PTM Trace
ARM STM
TPIU
TBR
ICEpick
PMU
Trace Funnel
Replicator
Tetr
is S
S
C66x DSP Kepler - 8 Cores
Cortex A15 Kepler - 4 Cores
Debug capabilities summary
8
Feature Shannon Nyquist Appleton Kepler
JTAG debug √ √ √ √
CoreSight ARM Debug for Cortex -- -- √ √
DSP AET – HWBP, WP, & sequencer √ √ √ √
DSP Trace – PC, data, & events √ √ √ √
Cortex HWBP & WP -- -- √ √
Cortex Performance Measurement Units (PMU) -- -- √ √
Cortex Processor Trace - PC & timing -- -- √ √
Cortex Processor Trace – Data -- -- √ --
STM SW messages √ √ √ √
STM CP Tracers 17 16 18 32
On-Chip Trace Buffer – DSP Trace 4KB 4KB 4KB 4KB
On-Chip Trace Buffer – STM 32KB 32KB 32KB 32KB (TBR)
On-Chip Trace Buffer – Cortex Trace -- -- 32KB 16 KB (TBR)
Trace Export 20 pins 20 pins 32 pins (TPIU) 32 pins (TPIU)
Embedded debug and analysis
• CtoolsLib – Enabling embedded debug, trace setup, and analysis use case
• Easy access to debug capabilities via simple C APIs
• Very low latency and small footprint (order of few KBs)
• Easy OS integration
• Integrated with MCSDK
• Easy import and data visualization via CCS
Field Deployed Debug and Trace
9
API Shannon Nyquist Appleton Kepler
AETLib √ √ √ √
ETBLib (with DMA draining support) √ √ √ √
DSPTraceLib √ √ √ √
ETMLib -- -- √ √
STMLib √ √ √ √
STM Linux driver -- -- √ √
CPTLib √ √ √ √
Workshops
10
MCSDK Image processing demo overview
11
Input Image
Bit Map
Image
RGB Slice
0
Luma (Y)
ImageGradient
ImageEdge
RGB Slice
1
Luma (Y)
ImageGradient
ImageEdge
RGB Slice
3
Luma (Y)
ImageGradient
ImageEdge
Combine
Edge
Slices and
Create Bit
Map Image
File Read
RGB extract
& Slicing
Slice 0
(Core 0)
Slice 1
(Core 1)
Slice 3
(Core 3)
Bit Map
Image
Output Image
RGB to Y
RGB to Y
RGB to Y
IMGLIB:
Sobel filter
IMGLIB:
Sobel filter
IMGLIB:
Sobel filter
IMGLIB:
Threshold
IMGLIB:
Threshold
File Write
Core0 (Master Core)
Processing
Core0-3 (Slave Core)
Processing
• This application shows implementation of an image processing system using a simple multicore framework. This application will run TI image processing kernels (a.k.a, imagelib) on multiple cores to do image processing (eg: edge detection, etc) on an input image from host PC.
• For more details on MCSDK Image processing demo, please refer to: http://processors.wiki.ti.com/index.php/MCSDK_Image_Processing_Demonstration_Guide
Setup and Installation overview • Hardware Setup
– XDS560v2 Pro Trace
– Nyquist (C6670) EVM
– Appleton (C6614) EVM (only required for ARM trace workshops)
– Ethernet cable connected between Nyquist EVM and host PC
• Software Installation
– Code Composer Studio v5.4
– BIOS-MCSDK v02.01.02.06 (+ patch 02.01.02.06P01) or newer
12
Setup and Installation overview 1. Download the following from links specified on previous page
i. CCS5.4.0.000xx_win32
ii. bios_mcsdk_02_01_02_06_setupwin32.exe (or newer)
iii. bios_mcsdk_02_01_02_06_patch01_setupwin32.exe (if using v2_01_02_06)
2. Install Code Composer Studio
3. Install BIOS MCSDK into c:\ti folder
4. Install BIOS MCSDK Patch (if required) into c:\ti folder
5. Start Code Composer Studio
6. Select a workspace when requested
7. Wait until CCS Add Discovered Products window comes up
8. Select OK
9. If warning pop-up, select OK
10. Say Yes to restarting CCS when requested
11. After CCS restarts, if requested to add other versions of NDK, select Cancel
12. Close TI Resource Explorer window
13
Steps to import and build the demo 1. From CCS main menu, select Project -> Import Existing CCS Eclipse Project
2. Browse to folder C:\ti\mcsdk_2_01_02_06\demos\image_processing\ipc\evmc6670l
3. Select OK
14
4. Select the following projects i. Image_processing_evmc6670l_slave
ii. Image_processing_evmc6670l_master
iii. Image_processing_evmc6670l_total_bandwidth_master
5. Select Finish
6. Right click in Project Explorer on each project imported and select Build Project
Steps to run the demo (i) • To setup the demo in static IP mode, SW9 position2 should be OFF. Other DIP switch settings, starting from
position1 to position4:
– SW3: OFF, ON, ON, ON SW4: ON, ON, ON, ON SW5: ON, ON, ON, ON
– SW6: OFF, ON, ON, ON SW9: ON, OFF, ON, ON
• Connect the Nyquist EVM to the host PC using Ethernet cable and
• Connect XDS560v2 Pro Trace to the EVM and PC (via USB). Power up the EVM and the XDS560v2.
• Change the host PC network settings to use static IP address 192.168.2.101.
– Got to “Control Panel” -> “Network and Sharing Center” -> “Change Adapter Settings”
– Right click on “Local Area Network” and change the “Properties”
– If you prefer to use DHCP, IP address is shown in CCS console after the target is run.
15
Steps to run the demo (ii) • In CCS, setup C6670 target configuration with XDS560v2 Pro Trace USB connection
– Go to File New Target Configuration File
– Type file name as C6670_XDS560v2 and click Finish
– Now select Connection as Spectrum Digital XDSPRO USB Emulator
– Type C6670 in the Device field; device names will be filtered; select/check TMS320C6670
– Click on Target Configuration from Advanced Setup (RHS)
– Now select C66x_0 and include <CCS_INSTALL>\ccsv5\ccs_base\emulation\boards\evmc6670l\gel\evm6670l.gel from “initialization script” box (RHS). Click on Save
– Select View->Target Configurations to see a list of all configuration files
– Select the one you just created (as C6670_XDS560v2 .ccxml) under User Defined
– Launch the debug session by selecting the Launch Selected Configuration in the context menu
• Group all the 4 C66x cores into one single group
16
Steps to run the demo (iii) • Connect the cores
• Load image_processing_evmc6670l_master.out (mcsdk_2_01_02_06 \ demos \ image_processing \ ipc \ evmc6670l \ master \ no_instrumentation \ Debug) on core0.
•
• Load image_processing_evmc6670l_slave.out (mcsdk_2_01_02_06 \ demos \ image_processing \ ipc \ evmc6670l \ slave \ no_instrumentation \ Debug) on cores1,2, and 3.
• Run all the cores from the CCS debug view
17
Steps to run the demo (iv)
18
• Open a web browser and type in 192.168.2.100 (EVM’s IP address) in the address box. One can see the following interactive webpage:
• Select Number of Cores as “Four”
• Browse and provide the path to a bitmap image evmc6678l_1920x1080_5_93MB.bmp (available at: mcsdk_2_01_02_06 \ demos \image_processing \ images)
• Click on Process
Steps to run the demo (v) Image processing demo output includes details such as processing time and output image.
19
WS1 Getting started with DSP trace
20
WS1 – Tracing program execution 1. If Image Processing Demo is not already running then complete steps i to iii of Steps
to Run Demo as described in Workshop Setup
4. Click to clear any saved/cached settings from previous run
5. Set Trace Range = End at Address
6. Set End Address = convert_rgb_to_y
7. Select Start to Open Trace Viewer
2. Select c66xx_0 in the debugger
3. In menu select Tools -> Hardware Trace Analyzer -> PC Trace to start PC Trace
21
List of analysis available is dependent on the selected
core(s) and their state Description of default configuration
WS1 – Tracing program execution 8. Trace Viewer status shows that buffer is already wrapped but will only be shown
when recording ends
9. In this case recording will end either when convert_rgb_to_y is executed (or C66xx_0 is halted)
10. Ignore the warning in the view for now. Clock frequency will be obtained when data collection stops
11. Complete “steps iv” of “Steps to Run Demo” (as described in Workshop Setup) to run process image
12. Wait for Trace Viewer to show all collected data
22
WS1 – Analyze trace result 13. Grab and drag column borders to resize as needed
14. Graph and drag column headers to reposition columns are required
15. Trace Viewer shows Program Addresses executed leading up to convert_rgb_to_y
Use this button to auto-fit all column width
Grab column edge and drag to resize column width
Grab column header and drag to move column
23
WS1 – View source code
1. In Trace Viewer, click on the record before convert_rgb_to_y
2. From Trace Viewer right-click-context-menu select Trace Viewer -> View Source Code
3. The file mcip_core.c is open at line 116 showing source code corresponding to the program address in the selected record
4. Scroll down in Trace Viewer to the record containing convert_rgb_to_y and notice the function convert_rgb_to_y is highlighted in the source file
24
WS1 – Function execution graph
1. From Trace Viewer toolbar, select Analyze -> Function Execution Graph
2. Click on the + next to Function on y-axis to expand graph
3. Double-click on graph title to expand graph to full-screen
4. Click multiple times on the Zoom out button in the graph toolbar to see entire execution
25
WS1 – Function execution graph
5. Grab y-axis with mouse and drag to see more of the name of the functions
6. Place mouse just below the x-axis and select that last bit of the graph to zoom into selected region
7. (Optional) From Function Execution Graph toolbar select Display Properties
8. (Optional) In the properties view, click on State/Event Categories tab, uncheck Visibility of functions that are not of interest, select OK. This will fit more of the graph in view
26
9. Note the graph shows what function is executing and not function entry/exist
10. From latter part of graph observe process_rgb making some uia logging calls then calling convert_rgb_to_y
11. Double click on graph title to collapse full-screen view
12. Click anywhere in graph to automatically scroll Trace Viewer to same cycle position
13. Click anywhere in the Trace Viewer to scroll graph to same cycle position
14. Click on Graph toolbar to disable grouping
15. Now click anywhere in graph and note that Trace Viewer is no longer scrolled
WS1 – Function execution graph
Function Name
Cycle count
Running Function
Sort Functions
Expand & Collapse
Enable/Disable
Grouping
27
15. Click on in Function Execution Graph toolbar then click at the beginning of an instance of process_rgb in the graph. This inserts a measurement marker 1 (X1)
16. Repeat step 1 but this time click at the end of same instance or process_rgb. This inserts measurement marker 2 (X2)
17. Look at top left corner of graph to see number of cycle between X1 and X2
18. While holding Shift button, use mouse to select and drag X2. Notice change in the number at top left corner of graph
19. Double-click on X2 to remove
20. From context menu select Remove All Measurement Marks to remove remaining markers (in this case only X1)
WS1 – Function execution graph
28
1. From Trace Viewer toolbar, select Analyze -> Program Address vs. Cycle
2. Click on the graph zoom out button ( ) multiple times to see entire range of program addresses executed
3. (Optional) While holding ‘ALT’ button, use mouse to zoom into a selected region
4. (Optional) Use the zoom reset button ( ) on graph toolbar to restore original zoom
WS1 – Program address graph
29
WS1: What did we learn?
• Can use DSP trace to get real-time tracing of program execution
• Analysis are available in Trace Viewer to process collected data
• Function Execution Graph provides a bird’s eye view of program execution
• Function Execution Graph can be used to measure the number of cycles between operations
• Program Address Graph shows what program addresses are executed
• Views have numerous features to help navigate the large volume of data that may be collected
30
WS2 Customize DSP trace for data tracing
31
WS2 – Tracing data access 1. Skip this step if continuing from WS1 or if Image Processing Demo is already running.
Complete steps i to iii of Steps to Run Demo (as described in Workshop Setup)
2. Select C66xx_0 in debugger
3. Open PC Trace Analysis from Tools -> Hardware Trace Analyzer -> PC Trace
4. If PC Trace was already running (from WS1) then select Close PC Trace in the Resource already in Use! dialog that pops-up (only 1 trace analysis can run on a cpu at any time)
5. Note that configuration from previous run is restored
6. In the Hardware Trace Configuration dialog select Advanced Settings
32
WS2 – Tracing data access 7. In Advanced Properties dialog note that this analysis has three trace “jobs”. A
receiver (in this case ETB) and two trigger jobs. One trigger to start trace and the other to end trace (when program address at convert_rgb_to_y is executed)
8. Select PC Trace in the left column
9. In the right column expand the Properties tree and to What to Trace properties and enable tracing of Write Data and Read Data
10. Select OK
11. Select Start in Hardware Trace Analysis Configuration
33
WS2 – Tracing data access
12. Complete step iv of Steps to Run Demo to process image
13. Wait for Trace Viewer to update with collected data
14. Note that Trace Viewer does not show Data Read and Data Write columns by default
15. In trace viewer toolbar click on Column Settings button
16. In Column Settings dialog, enable visibility of Read Data and Write Data
17. Select OK to exist dialog
18. In Trace Viewer resize and reposition columns as required
19. Scroll through Trace Viewer to see what data was read/written
34
WS2 – Saving configuration
1. Click on Analysis Properties button in Trace Viewer toolbar – This reopen the configuration dialog. Here properties can be modified and re-applied to the analysis
– We’ll not be modifying properties, instead we’ll save current configuration for future reuse
2. Press the Save button at the bottom of the configuration dialog
3. In Save Configuration dialog enter My Data Trace for Configuration Name then Save
4. Press Cancel to exit Hardware Trace Analysis Configuration dialog
5. Close the Trace Viewer
6. Go to Tools -> Hardware Trace Analysis -> User Configurations and note that My Data Trace is now available for reuse
35
WS2 – Sharing configuration
7. Create a c:\temp folder on your hard disk
8. Select Tools -> Hardware Trace Analysis -> User Configuration-> My Data Trace
9. Click Export Configuraiton button at bottom of configuration dialog
10. Browse to c:\temp folder, select Save
11. Click on Delete button at bottom of configuration dialog to delete this saved configuration
12. Go to Tools -> Hardware Trace Analyzer. Note that User Configurations no longer exists
13. Select Tools -> Hardware Trace Analyzer -> Import Configuration…
14. Browse to c:\temp, select File Name My Data Trace.zip, click Open
15. Go to Tools -> Hardware Trace Analyzer. Note that User Configurations now exists with My PC Trace
36
WS2- What did we learn?
• Trace can be used to monitor what data addresses and values are accessed
• Can further customize trace configuration using Advanced Settings
• Configurations can be saved for reuse
• Saved configurations can be exported/imported
37
WS3 Hotspot analysis with function, stall and
cache profiler
38
WS3 – Running function profiler 1. If Image Processing Demo is not running, complete steps i to iii of Steps to Run Demo
2. Select C66xx_0 in debugger
3. Open Function Profiler from Tools -> Hardware Trace Analyzer -> Function Profiling
4. Click to reset to original settings
5. Change Transport/Receiver Type to Pro Trace with Buffer Size 1MB
39
WS3 – Running function profiler 6. Click Data Collection Settings to expand
7. Select Start and Stop at Address for Trace Range. Note: Stop Address will not end trace just stop collection until Start is encountered again
8. Set Start Address = IMG_sobel_3x3_8 and End Address = MultiProc_self
40
9. Select Start
10. Complete step iv of Steps to Run Demo
11. Wait for Demo to complete
12. Press Stop in Trace Viewer toolbar
13. Note Trace Viewer and Exclusive Function Profiler processing data
WS3 – Analyzing function profile results 1. Wait for Trace Viewer and Exclusive Function Profiler processing to complete
2. Resize column width of Exclusive Function Profiler view as needed
3. Click on CPU Cycle Total column header twice to sort data in descending order
4. Note that assembly routines are shown as unknown_<address of first symbol above >_<address of first symbol below – 1>_<first symbol above>. Explicit names can be provided via xml file specified in preference
5. Note 2 functions IMG_thr_le2min_8() and IMG_sobel_3x3_8() are taking ~99% of time
41
WS3 – Analyzing function profile results 6. Scroll to right on Exclusive Function Profiler Table
7. Observe that ~92% (1.2M cycles) of IMG_thr_le2min_8() time was a result of pipeline stalls
42
WS3 – Running stalls profiler 1. Select Tools menu->Hardware Trace Analyzer->Function Profiling (C66xx_0)->Close
Session to close the current running Function Profiler
2. Open Stall Profiler from Tools -> Hardware Trace Analyzer -> Stall Profiling
43
WS3 – Running stall profiler
3. Click to reset to original settings
4. Change Transport/Receiver Type to Pro Trace with Buffer Size 1MB
5. Click on Advanced Settings to setup Start/Stop condition (Start/Stop support will be added
to the configuration dialog in the next release)
44
WS3 – Running stall profiler
6. Select Pipeline Stall Analysis trigger in left column of Advanced Properties dialog
7. Expand Properties tree in left column and change Actions to Start Trace and Location to IMG_sobel_3x3_8
8. Expand Global Category to see what events are collected by default
9. Click on in the left margin to add another trigger (default name can be changed)
10. Select Trigger2 in the left column and change Actions to End Trace and Location to MultiProc_self. Click somewhere else in property view to allow symbol to be evaluated
45
WS3 – Running stall profiler
46
11. Select OK in Advanced Properties dialog
12. Select Start in Hardware Trace Analysis Configuration dialog
13. Complete step iv of Steps to Run Demo
14. Wait for Demo to complete
15. Press Stop in Trace Viewer toolbar
16. Wait for Trace Viewer and Stall Cycle Profiler processing to complete
WS3 – Analyzing stall profiler results
47
1. Observe that ~1M of stall cycles for IMG_thr_le2min_8() is a result of L1D Read Misses
WS3 – Running cache analysis 1. Open Cache Analyzer from Tools -> Hardware Trace Analyzer -> Cache Analyzer
2. Close Stall Profiler when requested
3. Click to reset to original settings
4. Change Transport/Receiver Type to Pro Trace with Buffer Size 1MB
5. Expand Data Collection Settings and select LID Cache Miss Analysis
6. Click on Advanced Settings
48
WS3 – Running cache analysis 7. Select L1D Cache Miss Analysis trigger in left column
8. Expand Properties tree and change Actions to Start Trace and Location to IMG_sobel_3x3_8
9. Expand Global Category to see what event are collected by default
10. Click on in the left margin to add another trigger
11. Select Trigger2 in the left columns and change Actions to End Trace and Location to MultiProc_self
49
WS3 – Running cache analysis
50
12. Select OK in Advanced Properties dialog
13. Select Start in Hardware Trace Analysis Configuration dialog
14. Complete step iv of Steps to Run Demo
15. Wait for Demo to complete
16. Press Stop in Trace Viewer toolbar
15. Wait for Trace Viewer and Cache Event Profiler processing to complete
WS3 – Analyzing cache results
51
1. Observe that the 1.1M L1D Read Miss cycles of IMG_thr_le2min_8() is resulting from 16320 cache misses
WS3 – Using files to view data across analysis
52
1. Open Function Profiler from Tools -> Hardware Trace Analyzer -> Function Profiling
2. Select close Cache Analysis when requested
3. Set End Address = MultiProc_self. This is to work around the issue where this is not remembered.
4. Select Start
5. Complete step iv of Steps to Run Demo
6. Wait for Demo to complete
7. Press Stop in Trace Viewer toolbar
8. Wait for Trace Viewer and Exclusive Function Profiler processing to complete
9. Create a c:\temp folder on your hard disk if not already present
10. Select Save in Trace Viewer toolbar
11. Browse to c:\temp, Specify File Name mytrace and select Save
WS3 – Using files to view data across analysis
53
12. Click Start in Trace Viewer toolbar to restart tracing
13. Repeats steps 5 to 8 above to profile the application again
14. Select Open File from Tools -> Hardware Trace Analyzer -> Open File
15. Browse to c:\temp, select File Name mytrace.tdf and select Open
16. In Trace Viewer – MyFunctionProfileTrace.tdf select Analyze->Exclusive Function Profiling
17. Now current profile result can be visually compared with saved result
WS3 – Exporting data
54
1. In Exclusive Function Profile – C66xx_0 right-click-context menu, select Data -> Export All … to export all records
2. (Optional) In the Export Data Dialog Add/Remove columns to export
3. (Optional) In the Export Data Dialog use the Move button to rearrange order in which columns are to be exported
4. Browse to c:\temp folder, specify File Name myexporttrace and select Save, then select OK to export all records
5. Data is exported in CSV format which can be consumed by CCS and other tools such as Excel
WS3 – Importing data
55
1. Select Open File from Tools -> Hardware Trace Analyzer -> Open File
2. At Bottom Right corner of Open Trace File dialog select CSV trace data file (*.csv)
3. Browse to c:\temp, select File Name myexporttrace.csv, select Open
4. The data form the csv file is now visible in the Trace Viewer
WS3 – Using analysis dashboard
56
1. Select Open File from Tools -> Hardware Trace Analyzer -> Analysis Dashboard
2. Observe features of Dashboard shown below
3. Select Remove All ( ) to remove all running analysis
List of all running analysis
Run additional analysis
Delete selected analysis
Delete all analysis
Expand/collapse all nodes
Enable/Disable analysis. This free up all hardware resources
Open configuration dialog
What is the data source
Click to collapse/expand
Double click to open/select view
WS3- What did we learn?
• DSP Trace can be used to profile hotspots in application
• Hotspots can further be analyzed using stall and cache profiling
• Data can be saved to binary file to use for comparison with future results or to share with others
• Data can be exported/imported via CSV file
• Analysis Dashboard provides access to all analysis
57
WS4 Getting started with non-intrusive system
trace (STM) SoC profiling
58
WS4- Setup memory throughput analysis (i)
59
1. Continue from the WS3
2. Go to Tools Hardware Trace Analyzer Memory Throughput and Access Analysis
3. Select Transport Type Pro Trace, Buffer Type Stop-on-full, Buffer Size 64 MB and Number of Pins 4 pin. Go to the advanced settings
4. By default, DDR3 memory throughput will be captured.
5. For DDR3, as shown in the snapshot below, under Transaction Master enable only C66x_0 (core 0) and disable all other masters.
6. Click OK.
7. Now click on Start to setup the trace
8. Run all the cores from CCS, if not already running. Now, run the demo by following the steps in slides steps to run the demo (iv & v).
60
WS4- Setup memory throughput analysis (ii)
61
9. Hit Stop in Trace Viewer tab:
10. DDR CP tracer messages are captured in the Trace viewer tab:
WS4- Analysis view
WS4- View core0 DDR3 bandwidth utilization
62
11. Select Memory Throughput – CSSTM_0 tab and select DDR:CPU zoom to the portion of the graph where the image is being processed:
WS4- View system DDR3 bandwidth utilization
63
12. Now select DDR:All Bus Masters
WS4- View system DDR3 Latency
64
12. Select Minimum Average Latency – CSSTM_0 tab and zoom to the portion of the graph where the image is being processed:
WS4- What did we learn?
• On the MCSDK image processing demo, we compared Core0’s DDR3 bandwidth usage with the complete system’s DDR3 bandwidth usage.
• On the MCSDK image processing demo, we captured system’s DDR3 latency.
• We were able to setup and analyze DDR3 memory performance and access analysis for MCSDK image processing demo.
65
WS5 Customize STM SoC profiling for DDR and
MSMC bandwidth and latency analysis
66
WS5- Setup (i)
67
1. Continue from the WS4
2. Click on Analysis Properties for bringing up setup configuration dialog box
3. Now go to the advanced settings
WS5- Setup (ii) 4. We can see by default, DDR3 memory throughput will be captured.
5. Now add a custom trigger for capturing MSMC memory throughput.
6. In properties, select Transaction monitor MSMC_0, Average Access Size/Rate false
7.Click OK.
8. Now click on Start to setup the trace
9. Run all the cores from CCS, if not already running. Now, run the demo by following the steps in slides steps to run the demo (iv & v)
68
WS5- Analysis view
69
10. Hit Stop in Trace Viewer tab:
11. DDR and MSMC CP tracer messages are captured in the Trace viewer tab:
WS5- View core0 DDR3 bandwidth utilization
70
12. Select Memory Throughput – CSSTM_0 tab and select DDR:CPU zoom to the portion of the graph where the image is being processed:
WS5- View system DDR3 bandwidth utilization
71
13. Now select DDR:All Bus Masters:
WS5- View cores MSMC_0 bandwidth utilization
72
14. Now select MSMC_0:CPU :
WS5- View system MSMC_0 bandwidth utilization
73
15. Now select MSMC_0:All Bus Masters :
WS5- What did we learn?
• Customize non-intrusive SoC profiling (memory performance and access analysis) job to add MSMC memory bandwidth measurement to the default DDR memory bandwidth job.
• On the MCSDK image processing demo, we compared Core0’s DDR3 bandwidth usage with the complete system’s DDR3 bandwidth usage.
• On the MCSDK image processing demo, we compared all cores (0-3) MSMC (bank0) bandwidth usage with the complete system’s MSMC (bank0) bandwidth usage.
• Similar to MSMC_0, bandwidth at any other memory end point (core0 L2, core1 L2 . . ) can be measured.
74
WS6 DDR bandwidth and latency analysis by
interfacing embedded APIs (cToolsLib )
75
cToolsLib Software
76
Keystone Debug and Trace HW
AETLib DSPTraceLib ETBLib CPTLib STMLib
Ctools Use-Case Library (Ctools_UCLib)
CP Tracer profiling- System bandwidth
- System latency
- Master bandwidth- Total bandwidth
- Event profiling
PC Trace- Trace capture on an exception- Start and stop PC + timing trace
Memory watch- Capture a list of unintended accesses to a particular memory range- Raise an exception on the first unintended access
Statistical profiling
Application SW
• CToolsLib package is a collection of libraries that provides access to Keystone debug and trace features via software APIs.
• cToolsLib information and APIs are available on http://processors.wiki.ti.com/index.php/CToolsLib
• MCSDK also comes with cToolsLib package and Use Case library that encapsulates cToolsLib APIs into high level use cases for easy integration (c:\ti\ctoolslib_1_0_0_2)
WS6- Setup (i) 1. cToolsLib instrumentation APIs are embedded for the master core (core0). This workshop uses the image
processing demo with cToolsLib instrumentation.
2. Power cycle the C6670 EVM.
3. Follow all the steps in slide steps to run the demo (ii)
4. Connect the cores
5. Load image_processing_evmc6670l_total_bandwidth_master.out (mcsdk_2_01_02_06 \ demos \ image_processing \ ipc \ evmc6670l \ master \ total_bandwidth \ Debug) on core0.
6. Load image_processing_evmc6670l_slave.out (mcsdk_2_01_02_06 \ demos \ image_processing \ ipc \ evmc6670l \ slave \ no_instrumentation \ Debug) on cores1,2, and 3.
77
WS6- Setup (ii)
78
6. Go to Tools Hardware Trace Analyzer Custom System Trace and select Transport Type Pro Trace, Buffer Type Stop-on-full, Buffer Size 64 MB and Number of Pins 4 pin :
7. Hit Start and a trace viewer – CSSTM_0 tab is opened. 8. Now run all the cores. Run the demo, following all the steps in slides steps to run the demo (iv & v)
WS6- Analysis view
79
9. Hit Stop in Trace Viewer tab:
10. DDR CP tracer messages are captured in the Trace viewer tab:
WS6- View core0 DDR3 bandwidth utilization
80
11. From the trace viewer tab, select Analyze Memory Throughput
WS6- View system DDR3 bandwidth utilization
81
12. Now select DDR:All Bus Masters:
WS6- View system DDR3 Latency
82
13. From the trace viewer tab, select Analyze Minimum Average Latency
WS6- What did we learn?
• How to use cToolsLib embedded APIs to perform non-intrusive system trace (STM) SoC profiling.
• On the MCSDK image processing demo, we compared Core0’s DDR3 bandwidth usage with the complete system’s DDR3 bandwidth usage.
• On the MCSDK image processing demo, we captured system’s DDR3 latency.
• Using cToolsLib embedded APIs, we were able to setup and analyze DDR3 memory performance and access analysis for MCSDK image processing demo.
83
WS7 Command line trace decode interface for
offline decode
84
WS7- Invoking command line decoder
1. This workshop uses TD (http://processors.wiki.ti.com/index.php/TD), a command line decoder to convert hardware trace into human readable format.
2. Open a command line shell and change directory to <CCS>\ccsv5\ccs_base\emulation\analysis\bin
3. We use previously saved trace file (TDF) from the WS3 to decode offline
4. Type the following command
td -procid 66x -bin C:\temp\mytrace.tdf -app C:/ti/MCSDK_2_01_02_06/demos/image_processing/ipc/evm6670l/image_processing_evmc6670l_master.out -rcvr Pro -format CSV_NO_TPOS_QUOTE -columns "Program Address,Cycles,Trace Status" -timestamp abs –output mytrace.csv
Notes:
Without the “–output” option, the output would displayed on the stdio
if you do not specify “–column” option, all the columns will be outputted. Help is available with –help option.
85
WS7- Visualizing the output
1. The output from the command line decoder is below (opened in MS Excel) . This includes all the information.
2. If needed, the output CSV file could also be imported in CCS for further analysis.
86
WS7- Further processing the output
1. The output of the tool could be directed to other general purpose post processing tools for custom /command line processing.
td -procid 66x -bin C:\temp\mytrace.tdf -app C:/ti/MCSDK_2_01_02_06/demos/image_processing/ipc/evm6670l/image_processing_evmc6670l_master.out -rcvr Pro -format CSV_NO_TPOS_QUOTE -columns "Program Address,Cycles,Trace Status" -timestamp delta | grep “Pipeline stall”
87
WS7- What did we learn?
• How to invoke command line decoder for offline decode on trace captured elsewhere.
• The output can also be imported in CCS for analysis.
• The output can be stored in a text file or piped to other tools for additional processing.
88
WS8 Function profiling using Cortex A program
trace
89
WS8- Setup (i)
1. In CCS, setup C6614 target configuration with XDS560v2 Pro Trace USB connection
– Go to File New Target Configuration File
– Type file name as C6614_XDS560v2 and click Finish
– Now select Connection as Spectrum Digital XDS560V2 STM USB Emulator
– Type C6614 in the Device field; device names will be filtered; select/check TMS320C6614
– Click on Target Configuration from Advanced Setup (RHS)
– Select View->Target Configurations to see a list of all configuration files.
– Select the one you just created (as C6614_XDS560v2 .ccxml) under User Defined.
– Launch the debug session by selecting the Launch Selected Configuration in the context menu.
2. Connect Cortex A8 core
90
WS8- Setup (ii)
3. Download the Cortex A8 demo example (A8.zip) and unzip the files on your PC.
4. Load modem.out to the Cortex A8 from the downloaded example.
5. In menu select Tools -> Hardware Trace Analyzer -> PC Trace to start PC Trace
6. Click on Start to setup the trace and open Trace Viewer.
91
WS8- Tracing program execution 1. Now run the application from debug view (Resume/F8) and then halt or suspend the execution
2. Program trace shows up in the Trace Viewer
92
WS8- Profiling and execution flow graph 1. Function profiling can be run on the program execution trace by Analyze-> Exclusive Function
Profiler
2. Function execution graph can be launched by clicking Analyze -> Function Execution Graph
93
WS8- What did we learn?
• Can use Cortex A8 ETM trace to get real-time trace of program execution
• Analysis is available in Trace Viewer to run on collected data
• Profiling analysis provides summary of executed functions and cycles spent
• Function Execution Graph provides a bird’s eye view of program execution
• Can use Function Execution Graph to measure the number of cycles between operations
94