82
Junos ® Fabric and Switching Technologies THIS WEEK: QFABRIC SYSTEM TRAFFIC FLOWS AND TROUBLESHOOTING By Ankit Chadha Knowing how traffic flows through a QFabric system is knowing how your Data Center can scale.

This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

THIS WEEK: QFABRIC SYSTEM TRAFFIC FLOWS AND TROUBLESHOOTING

Traditional Data Center architecture follows a layered approach that uses separate switch de-

vices for access, aggregation, and core layers. But a completely scaled QFabric system combines

all the member switches and enables them to function as a single unit. So, if your Data Center

deploys a QFabric system with one hundred QFX3500 nodes, then those one hundred switches

will act like a single switch.

Traffic flows differently in this super-sized Virtual Chassis that spans your entire Data Center.

Knowing how traffic moves is critical to understanding and architecting Data Center opera-

tions, but it is also necessary to ensure efficient day-to-day operations and troubleshooting.

This Week: QFabric System Traffic Flows and Troubleshooting is a deep dive into how the QFabric

system externalizes the data plane for both user data and data plane traffic and why that’s such

a massive advantage from an operations point of view.

“QFabric is an unique accomplishment – making 128 switches look and function as one. Ankit brings to this book a background of both supporting QFabric customers, and as a resident engineer, implementing complex customer migrations. This deep dive into the inner workings of the QFabric system is highly recommended for anyone looking to implement or better understand this technology.” John Merline, Network Architect, Northwestern Mutual

LEARN SOMETHING NEW ABOUT QFABRIC THIS WEEK:

�� Understand the QFabric system technology in great detail.

�� Compare the similarities of the QFabric architecture with MPLS-VPN technology.

�� Verify the integrity of various protocols that ensure smooth functioning of the QFabric system.

�� Understand the various active/backup Routing Engines within the QFabric system.

�� Understand the various data plane and control plane flows for different kinds of traffic within the

QFabric system.

�� Operate and effectively troubleshoot issues that you might face with a QFabric deployment.

Published by Juniper Networks Books

www.juniper.net/booksISBN 978-1936779871

9 781936 779871

5 1 6 0 0

Junos® Fabric and Switching Technologies

THIS WEEK: QFABRIC SYSTEM TRAFFIC FLOWS AND TROUBLESHOOTING

By Ankit Chadha

Knowing how traffic flows through a QFabric

system is knowing how your Data Center can scale.

Page 2: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

THIS WEEK: QFABRIC SYSTEM TRAFFIC FLOWS AND TROUBLESHOOTING

Traditional Data Center architecture follows a layered approach that uses separate switch de-

vices for access, aggregation, and core layers. But a completely scaled QFabric system combines

all the member switches and enables them to function as a single unit. So, if your Data Center

deploys a QFabric system with one hundred QFX3500 nodes, then those one hundred switches

will act like a single switch.

Traffic flows differently in this super-sized Virtual Chassis that spans your entire Data Center.

Knowing how traffic moves is critical to understanding and architecting Data Center opera-

tions, but it is also necessary to ensure efficient day-to-day operations and troubleshooting.

This Week: QFabric System Traffic Flows and Troubleshooting is a deep dive into how the QFabric

system externalizes the data plane for both user data and data plane traffic and why that’s such

a massive advantage from an operations point of view.

“QFabric is an unique accomplishment – making 128 switches look and function as one. Ankit brings to this book a background of both supporting QFabric customers, and as a resident engineer, implementing complex customer migrations. This deep dive into the inner workings of the QFabric system is highly recommended for anyone looking to implement or better understand this technology.” John Merline, Network Architect, Northwestern Mutual

LEARN SOMETHING NEW ABOUT QFABRIC THIS WEEK:

�� Understand the QFabric system technology in great detail.

�� Compare the similarities of the QFabric architecture with MPLS-VPN technology.

�� Verify the integrity of various protocols that ensure smooth functioning of the QFabric system.

�� Understand the various active/backup Routing Engines within the QFabric system.

�� Understand the various data plane and control plane flows for different kinds of traffic within the

QFabric system.

�� Operate and effectively troubleshoot issues that you might face with a QFabric deployment.

Published by Juniper Networks Books

www.juniper.net/booksISBN 978-1936779871

9 781936 779871

5 1 6 0 0

Junos® Fabric and Switching Technologies

THIS WEEK: QFABRIC SYSTEM TRAFFIC FLOWS AND TROUBLESHOOTING

By Ankit Chadha

Knowing how traffic flows through a QFabric

system is knowing how your Data Center can scale.

Page 3: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

This Week: QFabric System Traffic Flows and Troubleshooting

By Ankit Chadha

Chapter 1: Physical Connectivity and Discovery . . . . . . . . . . . . . . . . . . . . . . . . 9

Chapter 2: Accessing Individual Components . . . . . . . . . . . . . . . . . . . . . . . . . 25

Chapter 3: Control Plane and Data Plane Flows . . . . . . . . . . . . . . . . . . . . . . . 39

Chapter 4: Data Plane Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Knowing how traffic flows through your QFabric system is knowing how your Data Center can scale .

Page 4: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

ii ii

© 2014 by Juniper Networks, Inc. All rights reserved. Juniper Networks, Junos, Steel-Belted Radius, NetScreen, and ScreenOS are registered trademarks of Juniper Networks, Inc. in the United States and other countries. The Juniper Networks Logo, the Junos logo, and JunosE are trademarks of Juniper Networks, Inc. All other trademarks, service marks, registered trademarks, or registered service marks are the property of their respective owners. Juniper Networks assumes no responsibility for any inaccuracies in this document. Juniper Networks reserves the right to change, modify, transfer, or otherwise revise this publication without notice. Published by Juniper Networks BooksAuthors: Ankit ChadhaTechnical Reviewers: John Merline, Steve Steiner, Girish SV Editor in Chief: Patrick AmesCopyeditor and Proofer: Nancy Koerbel J-Net Community Manager: Julie Wider

ISBN: 978-1-936779-87-1 (print)Printed in the USA by Vervante Corporation.ISBN: 978-1-936779-86-4 (ebook)

Version History: v1, April 2014 2 3 4 5 6 7 8 9 10

About the Author: Ankit Chadha is a Resident Engineer in the Advanced Services group of Juniper Networks. He has worked on QFabric system solutions in various capacities including solutions testing, engineering escalation, customer deployment, and design roles. He holds several industry recognized certifications such as JNCIP-ENT and CCIE-RnS.

Author’s Acknowledgments:I would like to thank Patrick Ames, our Editor in Chief, for his continuous encouragement and support from the conception of this idea until the delivery. There is no way that this book would have been successfully completed without Patrick’s support. John Merline and Steve Steiner provided invaluable technical review; Girish SV spent a large amount of time carefully reviewing this book and made sure that it’s ready for publishing. Nancy Koerbel made sure that it shipped without any embarrassing mistakes. Thanks to Steve Steiner, Mahesh Chandak, Ruchir Jain, Vaibhav Garg, and John Merline for being great mentors and friends. Last but not least, I’d like to thank my family and my wife, Tanu, for providing all the support and love that they always do.

This book is available in a variety of formats at: http://www.juniper.net/dayone.

Page 5: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

iii iii

Welcome to This Week

This Week books are an outgrowth of the popular Day One book series published by Juniper Networks Books. Day One books focus on providing just the right amount of information that you can execute, or absorb, in a day. This Week books, on the other hand, explore networking technologies and practices that in a class-room setting might take several days to absorb or complete. Both libraries are available to readers in multiple formats:

� Download a free PDF edition at http://www.juniper.net/dayone.

� Get the ebook edition for iPhones and iPads at the iTunes Store>Books. Search for Juniper Networks Books.

� Get the ebook edition for any device that runs the Kindle app (Android, Kindle, iPad, PC, or Mac) by opening your device’s Kindle app and going to the Kindle Store. Search for Juniper Networks Books.

� Purchase the paper edition at either Vervante Corporation (www.vervante.com) or Amazon (www.amazon.com) for prices between $12-$28 U.S., depending on page length.

� Note that Nook, iPad, and various Android apps can also view PDF files.

� If your device or ebook app uses .epub files, but isn’t an Apple product, open iTunes and download the .epub file from the iTunes Store. You can now drag and drop the file out of iTunes onto your desktop and sync with your .epub device.

What You Need to Know Before Reading

Before reading this book, you should be familiar with the basic administrative functions of the Junos operating system, including the ability to work with opera-tional commands and to read, understand, and change Junos configurations. There are several books in the Day One library to help you learn Junos administration, at www.juniper.net/dayone.

This book makes a few assumptions about you, the reader:

� You have a working understanding of Junos and the Junos CLI, including configuration changes using edit mode. See the Day One books at www.juniper.net/dayone for a variety of tutorials on Junos at all skill levels.

� You can make configuration changes using the CLI edit mode. See the Day One books at www.juniper.net/dayone for a variety of tutuorials on Junos at all skill levels.

� You have an understanding of networking fundamentals like ARP, MAC addresses, etc.

� You have a thorough familiarity with BGP fundamentals.

� You have a thorough familiarity with MP-BGP and MPLS-VPN fundamentals and their terminologies. See This Week: Deploying MBGP Multicast VPNs, Second Edition at www.juniper.net/dayone for a quick review.

� Finally, this book uses outputs from actual QFabric systems/deployments – readers are strongly encouraged to have a stable lab setup to execute those commands.

Page 6: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

iv iv

What You Will Learn From This Book

� You’ll understand the working of the QFabric technology (in great detail).

� You’ll be able to compare the similarities of the QFabric architecture with MPLS VPN technology.

� You’ll be able to verify the integrity of various protocols that ensure smooth functioning of the QFabric system.

� You’ll understand the various active/backup Routing Engines within the QFabric.

� You’ll understand the various data plane, control plane flows for different kinds of traffic within the QFabric system.

� You’ll be able to operate and effectively troubleshoot issues that you might face with a QFabric system deployment.

Information Experience

This book is singularly focused on one aspect of networking technology. There are other sources at Juniper Networks, from white papers to webinars to online forums such as J-Net (forums.juniper.net). Look for the following sidebars to directly access other superb informational resources:

MORE? It’s highly recommended you go through the technical documentation and the minimum requirements to get a sense of QFabric hardware and deployment before you jump in. The technical documentation is located at www.juniper.net/documen-tation. Use the Pathfinder tool on the documentation site to explore and find the right information for your needs.

About This Book

This book focuses on the inner workings and internal traffic flows of the Juniper Networks QFabric solution and does not address deployment or configuration practices.

MORE? The complete deployment guide for the QFabric can be found here: https://www.juniper.net/techpubs/en_US/junos11.3/information-products/pathway-pages/qfx-series/qfabric-deployment.html.

QFabric vs. Legacy DataLCenter Architecture

Traditional Data Center architecture follows a layered approach to building a Data Center using separate switch devices for access, aggregation, and core layers. Obvi-ously these devices have different capacities with respect to their MAC table sizes, depending on their role or their placement in the different layers.

Since Data Centers host mission critical applications, redundancy is of prime impor-tance. To provide the necessary physical redundancy within a Data Center, Spanning Tree Protocol (STP) is used. STP is a popular technology and is widely deployed around the world. A Data Center like the one depicted in Figure A.1 always runs some flavor of STP to manage the physical redundancy. But there are drawbacks to using Spanning Tree Protocol for redundancy:

Page 7: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

v

Figure A.1 Traditional Layered Data Center Topology

� STP works on the basis of blocking certain ports, meaning that some ports can potentially be overloaded, while the blocked ports do not forward any traffic at all. This is highly undesirable, especially because the switch ports deployed in a Data Center are rather costly.

� This situation of some ports not forwarding any traffic can be overcome somewhat by using different flavors of the protocol, like PVST or MSTP, but STP inherently works on the principle of blocking ports. Hence, even with PVST or MSTP, complete load balancing of traffic over all the ports cannot be achieved. Using PVST and MSTP versions, load balancing can be done across VLANs – one port can block for one VLAN or a group of VLANs and another port can block for the rest of the VLANs. However, there is no way to provide load balancing for different flows within the same VLAN.

� Spanning Tree relies on communication between different switches. If there is some problem with STP communication, then the topology change recalculations that follow can lead to small outages across the whole Layer 2 domain. Even small outages like these can cause significant revenue loss for applications that are hosted on your Data Center.

By comparison, a completely scaled QFabric system can have up to 128 member switches. This new technology works by combining all the member switches and making them function as a single unit to other external devices. So if your Data Center deploys a QFabric with one hundred QFX3500 nodes, then all those one hundred switches will act as a single switch. In short, that single switch (QFabric) will have (100x48) 4800 ports!

Page 8: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

vi

Since all the different QFX3500 nodes act as a single switch, there is no need to run any kind of loop prevention protocol like Spanning Tree. At the same time, there is no compromise on redundancy because all the Nodes have redundant connections to the backplane (details on the connections between different components of a QFabric system are discussed throughout this book). This is how the QFabric solution takes care of the STP problem within the Data Center.

Consider the case of a traditional (layered) Data Center design. Note that if two hosts connected to different access switches need to communicate with each other, they need to cross multiple switches in order to do that. In other words, communication in the same or a different VLAN might need to cross multiple switch hops to be successful.

Since all the Nodes within a QFabric system work together and act as a large single switch, all the external devices connected to the QFabric Nodes (servers, filers, load balancers, etc.) are just one hop away from each other. This leads to a lower number of lookups, and hence, considerably reduces latency.

Different Components of a QFabric System

A QFabric system has multiple physical and logical components – let’s identify them here so you have a common place you can return to when you need to review them.

PhysicalComponents

A QFabric system has the following physical components as shown in Figure A.2:

� Nodes: these are the top-of-rack (TOR) switches to which external devices are connected. All the server-facing ports of a QFabric system reside on the Nodes. There can be up to 128 Nodes in a QFabric-G system and up to 16 Nodes in a QFabric-M implementation. Up to date details on the differences between various QFabric systems can be found here: http://www.juniper.net/us/en/products-services/switching/qfabric-system/#overview.

� Interconnects: The Interconnects act as the backplane for all the data plane traffic. All the Nodes should be connected to all the Interconnects as a best practice. There can be up to four Interconnects (QFX3008-I) in both QFabric-G and QFabric-M implementations.

� Director Group: There are two Director devices (DG0 and DG1) in both QFabric-G and QFabric-M implementations. These Director devices are the brains of the whole QFabric system and host the necessary virtual components (VMs) that are critical to the health of the system. The two Director devices operate in a master/slave relationship. Note that all the protocol/route/inventory states are always synced between the two.

� Control Plane Ethernet Switches: These are two independent EX VCs or EX switches (in case of QFabric-G and QFabric-M, respectively) to which all the other physical components are connected. These switches provide the necessary Ethernet network over which the QFabric components can run the internal protocols that maintain the integrity of the whole system. The LAN segment created by these devices is called the Control Plane Ethernet segment or the CPE segment.

Page 9: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

vii

Figure A.2 Components of a QFabric System

VirtualComponents

The Director devices host the following Virtual Machines:

� Network Node Group VM: The NWNG-VM are the routing brains for a QFabric system, where all the routing protocols like OSPF, BGP, or PIM are run. There are two NWNG-VMs in a QFabric system (one hosted on each DG) and they operate in an active/backup fashion with the active VM always being hosted on the master Director device.

� Fabric Manager: The Fabric Manager VM is responsible for maintaining the hardware inventory of the whole system. This includes discovering new Nodes and Interconnects as they’re added and keeping a track of the ones that are removed. The Fabric Manager is also in charge of keeping a complete topological view of how the Nodes are connected to the Interconnects. In addition to this, the FM also needs to provide internal IP addresses to every other component to allow for the internal protocols to operate properly. There is one Fabric Manager VM hosted on each Director device and these VMs operate in an active/backup configuration.

� Fabric Control: The Fabric Manager VM is responsible for distributing various routes (Layer 2 or Layer 3) to different Nodes of a QFabric system. This VM forms internal BGP adjacencies with all the Nodes and Interconnects and sends the appropriate routes over these BGP peerings. There is one Fabric Manager VM hosted on each Director device and these operate in an active/active fashion.

Page 10: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

viii

NodeGroupsWithinaQFabricSystem

Node groups is a new concept introduced by the QFabric technology; it is a logical collection of one or more physical Nodes that are part of a QFabric system. When-ever multiple Nodes are configured to be part of a Node group, they act as one. Individual Nodes can be configured to be a part of these kinds of Node groups:

� Server Node Group (SNG): This is the default group and consists of one Node. Whenever a Node becomes part of a QFabric system, it comes up as an SNG. These mostly connect to servers that do not need any cross Node redundancy. The most common examples are servers that have only one NIC.

� Redundant Server Node Group (RSNG): An RSNG consists of two physical Nodes. The Routing Engines on the Nodes operate in an active/backup fashion (think of a Virtual Chassis with two member switches). You can configure multiple pairs of RSNGs within a QFabric system. These mostly connect to dual-NIC servers.

� Network Node Group (NWNG): Each QFabric has one Network Node Group and up to eight physical Nodes can be configured to be part of the NWNG. The Routing Engines (RE) on the Nodes are disabled and the RE functionality is handled by the NWNG-VMs that are located on the Director devices.

MORE? Every Node device can be a part of only one Node group at a time. The details on how to configure different kinds of Node groups can be found here: http://www.juniper.net/techpubs/en_US/junos12.2/topics/task/configuration/qfabric-node-groups-configuring.html.

NOTE Chapter 3 covers these abstractions, including a discussion of packet flows.

Differences Between a QFabric System and a Virtual Chassis

Juniper EX Series switches support Virtual Chassis (VC) technology, which enables multiple physical switches to be combined. These multiple switches then act as a single switch.

MORE? For more details on the Virtual Chassis technology, refer to the following technical documentation: https://www.juniper.net/techpubs/en_US/junos13.3/topics/concept/virtual-chassis-ex4200-components.html.

One of the advantages of a QFabric system is its scale. A virtual chassis can host tens of switches, but a fully scaled QFabric system can have a total 128 Nodes combined.

The QFabric system, however, is much more than a supersized Virtual Chassis. QFabric technology completely externalizes the data plane because of the Intercon-nects. Chapter 4 discusses the details of how user data or data plane traffic flows through the external data plane.

Another big advantage of the QFabric system is that the Nodes can be present at various locations within the Data Center. The Nodes are normally deployed as top-of-rack (TOR) switches. To connect to the backplane, the Nodes connect to the Interconnects. That’s why QFabric is such a massive advantage from an operations point of view, as one large switch that spans the entire Data Center and the cables from the servers still plugs in to only the top-of-rack switches.

Various components of the QFabric system (including the Interconnects) are dis-cussed throughout this book and you can return to these pages to review these basic definitions at any time.

Page 11: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter 1

Physical Connectivity and Discovery

Interconnections of Various Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Why Do You Need Discovery? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13

System and Component Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Fabric Topology Discovery (VCCPDf) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Relation Between VCCPD and VCCPDf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Test Your Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Page 12: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

10 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

This chapter discusses what a plain-vanilla QFabric system is supposed to look like. It does not discuss any issues in the data plane or about packet forwarding; its only focus is the internal workings of QFabric and checking the protocols that are instru-mental in making the QFabric system function as a single unit.

The important first step in setting up a QFabric system is to cable it correctly.

MORE? Juniper has great documentation about cabling and setting up a QFabric system, so it won’t be repeated here. If you need to, review the best practices of QFabric cabling: https://www.juniper.net/techpubs/en_US/junos11.3/information-products/pathway-pages/qfx-series/qfabric-deployment.html

Make sure that the physical connections are made exactly as mentioned in the deploy-ment guide. That’s how the test units used for this book were set up. Any variations in your lab QFabric system might cause discrepencies with the correlating output that is shown in this book.

Interconnections of Various Components

As discussed, the QFabric system consists of multiple physical components and these components need to be connected to each other as well. Consider these inter-compo-nent links:

� Nodes to EX Series VC: These are 1GbE links

� Interconnects to EX Series VC: These are 1GbE links

� DG0/DG1 to EX Series VC: These are 1GbE links

� DG0 to DG1: These are 1GbE links

� Nodes to Interconnects: These are 40GbE links

All these physical links are Gigabit Ethernet links except for the 40GbE links between the Nodes and the Interconnects – these 40 GbE links show up as FTE interfaces on the CLI. The usual Junos commands like show interfaces terse and show interfac-es extensive apply and should be used for troubleshooting any issues related to finding errors on the physical interfaces.

The only devices where these Junos commands cannot be run are the Director devices because they run on Linux. However, the usual Linux commands do work on Direc-tor devices (like ifconfig, top, free, etc.).

Let’s start with one of the most common troubleshooting utilities that a network engi-neer needs to know about: checking the status of the interfaces and their properties on the Director devices.

To check the status of interfaces of the Director devices from the Linux prompt, the regular ifconfig command can be used. However, this output uses the following keywords for specific interfaces types:

� Bond0: This is the name of the aggregated interface that gets connected to the other Director device (DG). The two Director devices are called DG0 and DG1. Note that the IP address for the Bond0 interface on DG0 is always 1.1.1.1 and on DG1 it is always set to 1.1.1.2. This link is used for syncing and maintaining the states of the two Director devices. These states include VMs, configurations, file transfers (cores), etc.

Page 13: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter1:PhysicalConnectivityandDiscovery 11

� Bond1: This aggregated interface is used mainly for internal Control plane communication between the Director devices and other QFabric components like the Nodes and the Interconnects.

� Eth0: This is the management interface of the DG. This interface gets con-nected to the network and you can SSH to the IP address of this interface from from an externally reachable machine. Each Director device has an interface called Eth0, which should be connected to the management network. At the time of installation, the QFabric system prompts the user to enter the IP address for the Eth0 interface of each Director device. In addition to this, the user is required to add a third IP address called the VIP (Virtual IP Address). This VIP is used to manage the operations of QFabric, such as SSH, telnet, etc.

Also, the CLI command show fabric administration inventory director-group status shows the status of all the interfaces. Here is sample output of this CLI command:

root@TEST-QFABRIC> show fabric administration inventory director-group statusDirector Group Status Tue Feb 11 08:32:50 CST 2014 Member Status Role Mgmt Address CPU Free Memory VMs Up Time ------ ------ -------- --------------- --- ----------- --- ------------- dg0 online master 172.16.16.5 1% 3429452k 4 97 days, 02:14 hrs dg1 online backup 172.16.16.6 0% 8253736k 3 69 days, 23:42 hrs Member Device Id/Alias Status Role ------ ---------------- ------- --------- dg0 TSTDG0 online master Master Services --------------- Database Server online Load Balancer Director online QFabric Partition Address offline Director Group Managed Services ------------------------------- Shared File System online Network File System online Virtual Machine Server online Load Balancer/DHCP online Hard Drive Status ----------------- Volume ID:0FFF04E1F7778DA3 optimal Physical ID:0 online Physical ID:1 online Resync Progress Remaining:0 0% Resync Progress Remaining:1 0%

Size Used Avail Used% Mounted on ---- ---- ----- ----- ---------- 423G 36G 366G 9% / 99M 16M 79M 17% /boot 93G 13G 81G 14% /pbdata Director Group Processes ------------------------ Director Group Manager online Partition Manager online Software Mirroring online Shared File System master online Secure Shell Process online Network File System online FTP Server online Syslog online Distributed Management online

Page 14: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

12 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

SNMP Trap Forwarder online SNMP Process online Platform Management online Interface Link Status --------------------- Management Interface up Control Plane Bridge up Control Plane LAG up CP Link [0/2] down CP Link [0/1] up CP Link [0/0] up CP Link [1/2] down CP Link [1/1] up CP Link [1/0] up Crossover LAG up CP Link [0/3] up CP Link [1/3] up Member Device Id/Alias Status Role ------ ---------------- ------- --------- dg1 TSTDG1 online backup Director Group Managed Services ------------------------------- Shared File System online Network File System online Virtual Machine Server online Load Balancer/DHCP online Hard Drive Status ----------------- Volume ID:0A2073D2ED90FED4 optimal Physical ID:0 online Physical ID:1 online Resync Progress Remaining:0 0% Resync Progress Remaining:1 0%

Size Used Avail Used% Mounted on ---- ---- ----- ----- ---------- 423G 39G 362G 10% / 99M 16M 79M 17% /boot 93G 13G 81G 14% /pbdata Director Group Processes ------------------------ Director Group Manager online Partition Manager online Software Mirroring online Shared File System master online Secure Shell Process online Network File System online FTP Server online Syslog online Distributed Management online SNMP Trap Forwarder online SNMP Process online Platform Management online Interface Link Status --------------------- Management Interface up Control Plane Bridge up Control Plane LAG up CP Link [0/2] down CP Link [0/1] up CP Link [0/0] up CP Link [1/2] down

Page 15: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter1:PhysicalConnectivityandDiscovery 13

CP Link [1/1] up CP Link [1/0] up Crossover LAG up CP Link [0/3] up CP Link [1/3] uproot@TEST-QFABRIC> --snip--

Note that this output is taken from a QFabric-M system, and hence, port 0/2 is down on both the Director devices.

Details on how to connect these ports on the DGs is discussed in the QFabric Installation Guide cited at the beginning of this chapter, but once the physical installation of a QFabric system is complete, you should verify the status of all the ports. You’ll find that once a QFabric system is installed correctly, it is ready to forward traffic, and the plug-and-play features of the QFabric technology make it easy to install and maintain.

However, a single QFabric system has multiple physical components, so let’s assume you’ve cabled your test bed correctly in your lab and review how a QFabric system discovers its multiple components and makes sure that those different components act as a single unit.

Why Do You Need Discovery?

The front matter of this book succinctly discusses the multiple physical components that comprise a QFabric system. The control plane consists of the Director groups and the EX Series VC. The data plane of a QFabric system consists of the Nodes and the Interconnects. A somewhat loose (and incorrect) analogy that might be drawn is that the Director groups are similar to the Routing Engines of a chassis-based switch, the Nodes are similar to the line cards, and the Interconnects are similar to the backplane of a chassis-based switch. But QFabric is different from a chassis-based switch as far as system discovery is concerned.

Consider a chassis-based switch. There are only a certain number of slots in such a device and the line cards can only be inserted into the slots available. After a line card is inserted in one of the available slots, it is the responsibility of the Routing Engine to discover this card. Note that since there are a finite number of slots, it is much easier to detect the presence or absence of a line card in a chassis with the help of hardware-based assistance. Think of it as a hardware knob that gets activated whenever a line card is inserted into a slot. Hence, discovering the presence or absence of a line card is easy in a chassis-based device.

However, QFabric is a distributed architecture that was designed to suit the needs of a modern Data Center. A regular Data Center has many server cabinets and the Nodes of a QFabric system can act as the TOR switches. Note that even though the Nodes can be physically located in different places within the Data Center, they still act as a single unit.

One of the implications of this design is that the QFabric system can no longer use a hardware-assist mechanism to detect different physical components of the QFabric system. For this reason, QFabric uses an internal protocol called Virtual Chassis Control Protocol Daemon (VCCPD ) to make sure that all the system components can be detected.

Page 16: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

14 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

System and Component Discovery

Virtual Chassis Control Protocol Daemon runs on the Control plane Ethernet network and is active by default on all the components of the QFabric system. This means that VCCPD runs on all the Nodes, Interconnects, Fabric Control VMs, Fabric Manager VMs, and the Network Node Group VMs. Note that this protocol runs on the backup VMs as well.

There is a VM that runs on the DG whose function is to make these VCCPD adjacen-cies with all the devices. This VM is called Fabric Manager, or FM.

The Control plane Ethernet network is comprised of the EX Series VC and all of the physical components that have Ethernet ports connected to these EX VCs. Since there are no IP addresses on the devices when they come up, VCCPD protocol uses the IS-IS protocol to make sure that no IP addresses are needed for the system discovery. All the components send out and receive VCCPD Hello messages on the Control plane Ethernet (CPE) network. With the help of these messages, the Fabric Manager VM is able to detect all the components that are connected to the EX Series VCs.

Consider a system in which you have only the DGs connected to the EX Series VC. The DGs host the Fabric Manager VMs, which send VCCPD Hellos on the CPE network. When a new Node is connected to the CPE, then FM and the new Node form a VCCPD adjacency, and this is how the DGs detect the event of a new Node’s addition to the QFabric system. This same process holds true for the Interconnect devices, too.

After the adjacency is created, the Nodes, Interconnects, and the FM send out periodic VCCPD Hellos on the CPE network. These Hellos act as heartbeat messages and bidirectional Hellos confirm the presence or absence of the components.

If the FM doesn’t receive a VCCPD Hello within the hold time, then that device is considered dead and all the routes that were originated from that Node are flushed out from other Nodes.

Like any other protocol, VCCPD adjacencies are formed by the Routing Engine of each component, so VCCPD adjacency stats are available at:

� The NNG VM for the Node devices that are a part of the Network Node Group

� The master RSNG Node device for a Redundant Server Node Group

� The RE of a standalone Server Node Group device

The show virtual-chassis protocol adjacency provisioning CLI command shows the status of the VCCPD adjacencies:

qfabric-admin@NW-NG-0> show virtual-chassis protocol adjacency provisioningInterface System State Hold (secs)vcp1.32768 P7814-C Up 28vcp1.32768 P7786-C Up 28vcp1.32768 R4982-C Up 28vcp1.32768 TSTS2510b Up 29vcp1.32768 TSTS2609b Up 28vcp1.32768 TSTS2608a Up 27vcp1.32768 TSTS2610b Up 28vcp1.32768 TSTS2611b Up 28vcp1.32768 TSTS2509b Up 28vcp1.32768 TSTS2511a Up 29vcp1.32768 TSTS2511b Up 28vcp1.32768 TSTS2510a Up 28

Page 17: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter1:PhysicalConnectivityandDiscovery 15

vcp1.32768 TSTS2608b Up 29vcp1.32768 TSTS2610a Up 28vcp1.32768 TSTS2509a Up 28vcp1.32768 TSTS2611a Up 28vcp1.32768 TSTS1302b Up 29vcp1.32768 TSTS2508a Up 29vcp1.32768 TSTS2508b Up 29vcp1.32768 TSTNNGS1205a Up 27vcp1.32768 TSTS1302a Up 28vcp1.32768 __NW-INE-0_RE0 Up 28vcp1.32768 TSTNNGS1204a Up 29vcp1.32768 G0548/RE0 Up 27vcp1.32768 G0548/RE1 Up 28vcp1.32768 G0530/RE1 Up 29vcp1.32768 G0530/RE0 Up 28vcp1.32768 __RR-INE-1_RE0 Up 29vcp1.32768 __RR-INE-0_RE0 Up 29vcp1.32768 __DCF-ROOT.RE0 Up 29vcp1.32768 __DCF-ROOT.RE1 Up 28 {master}

The same output can also be viewed from the Fabric Manager VM:

root@Test-QFabric> request component login FM-0Warning: Permanently added 'dcfnode---dcf-root,169.254.192.17' (RSA) to the list of known hosts.--- JUNOS 12.2X50-D41.1 built 2013-03-22 21:44:05 UTCqfabric-admin@FM-0>qfabric-admin@FM-0> show virtual-chassis protocol adjacency provisioningInterface System State Hold (secs)vcp1.32768 P7814-C Up 27vcp1.32768 P7786-C Up 28vcp1.32768 R4982-C Up 29vcp1.32768 TSTS2510b Up 29vcp1.32768 TSTS2609b Up 28vcp1.32768 TSTS2608a Up 29vcp1.32768 TSTS2610b Up 28vcp1.32768 TSTS2611b Up 28vcp1.32768 TSTS2509b Up 27vcp1.32768 TSTS2511a Up 29vcp1.32768 TSTS2511b Up 29vcp1.32768 TSTS2510a Up 27vcp1.32768 TSTS2608b Up 28vcp1.32768 TSTS2610a Up 28vcp1.32768 TSTS2509a Up 28vcp1.32768 TSTS2611a Up 28vcp1.32768 TSTS1302b Up 28vcp1.32768 TSTS2508a Up 27vcp1.32768 TSTS2508b Up 29vcp1.32768 TSTNNGS1205a Up 28vcp1.32768 TSTS1302a Up 29vcp1.32768 __NW-INE-0_RE0 Up 28vcp1.32768 TSTNNGS1204a Up 28vcp1.32768 G0548/RE0 Up 28vcp1.32768 G0548/RE1 Up 29vcp1.32768 G0530/RE1 Up 28vcp1.32768 G0530/RE0 Up 27vcp1.32768 __RR-INE-1_RE0 Up 29vcp1.32768 __NW-INE-0_RE1 Up 28vcp1.32768 __DCF-ROOT.RE0 Up 29vcp1.32768 __RR-INE-0_RE0 Up 28--snip--

Page 18: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

16 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

VCCPD Hellos are sent every three seconds and the adjacency is lost if the peers don’t see each other’s Hellos for 30 seconds.

After the Nodes and Interconnects form VCCPD adjacencies with the Fabric Man-ager VM, the QFabric system has a view of all the connected components.

Note that the VCCPD adjacency only provides details about how many Nodes and Interconnects are present in a QFabric system. VCCPD does not provide any infor-mation about the data plane of the QFabric system; that is, it doesn’t provide infor-mation about the status of connections between the Nodes and the Interconnects.

Fabric Topology Discovery (VCCPDf)

The Nodes can either be QFX-3500s, or QFX-3600s (QFX5100s are supported as a QFabric node only from 13.2X52-D10 onwards), and both of these have four FTE links by default. Note that the term FTE-link here means the links that can be connected to the Interconnects.The number of FTE links on a QFX 3600 can be modified by using the CLI, but this modification can not be preformed on the QFX 3500. These FTE links can be connected to up to four different Interconnects and the QFabric system uses a protocol called VCCPDf (VCCPD over fabric links) which helps the Director devices form a complete topological view of the QFabric system.

One of the biggest advantages of the QFabric technology is its flexibility and its ability to scale. To further understand this flexibility and scalability, consider a new Data Center deployment in which the initial bandwidth requirements are so low that none of the Nodes are expected to have more than 80 Gbps of incoming traffic at any given point in time. This means that this Data Center can be deployed with all the Nodes having just two out of the four FTE links connected to the Interconnects. To have the necessary redundancy, these two FTE links would be connected to two different Interconnects.

In short, such a Data Center can be deployed with only two Interconnects. However, as the traffic needs of the Data Center grow, more Interconnects can be deployed and then the Nodes can be connected to the newly added Interconnects to allow for greater data plane bandwidth. This kind of flexibility can allow for future proofing of an investment made in the QFabric technology.

Note that a QFabric system has the built in intelligence to figure out how many FTE links are connected on each Node and this information is necessary to be able to know how to load-balance various kinds of traffic between different Nodes.

The QFabric technology uses VCCPDf to figure out the details of the data plane. Whenever a new FTE link is added or removed, it triggers the creation of a new VCCPDf adjacency or the deletion of an existing VCCPDf adjacency, respectively. This information is then fed back to the Director devices over the CPE links so that the QFabric system can always maintain a complete topological view of how the Nodes are connected to the Interconnects. Basically, VCCPDf is a protocol that runs on the FTE links between the Nodes and the Interconnects.

VCCPDf runs on all the Nodes and the Interconnects but only on the 40GbE (or FTE) ports. VCCPDf utilizes the neighbor discovery portion of IS-IS. As a result, each Node device would be able to know how many Interconnects it is connected to, the device ID of those Interconnects, and the connected port’s ID on the Interconnects. Similarly each Interconnect would be able to know how many Node devices it is connected to, the device ID of those Node devices, and the connected port’s ID on the Node devices. This information is fed back to the Director devices. With the help of this information, the Director devices are able to formulate the complete topological

Page 19: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter1:PhysicalConnectivityandDiscovery 17

picture of the QFabric system.

This topological information is necessary in order to configure the forwarding tables of the Node devices efficiently. The sequence of steps mentioned later in this chapter will explain why the topological database is needed. (This topological database contains information about how the Nodes are connected to the Interconnects).

Relation Between VCCPD and VCCPDf

All Juniper devices that run the Junos OS run a process called chassisd (chassis dae-mon). The chassisd process is responsible for monitoring and managing all the hard-ware-based components present on the device.

QFabric software also uses chassisd. Since there is a system discovery phase involved, inventory management is a little different in this distributed architecture.

Here are the steps that take place internally with respect to system discovery, VCCPD, and VCCPDf:

Nodes, Interconnects, and the VMs exchange VCCPD Hellos on the control plane Ethernet network.

The Fabric Manager VM processes the VCCPD Hellos from the Nodes and the Interconnects. The Fabric Manager VM then assigns a unique PFE-ID to each Node and Interconnect. (The algorithm behind the generation of PFE-ID is Juniper confidential and is beyond the scope of this book.)

This PFE-ID is also used to derive the internal IP address for the components.

After a Node or an Interconnect is detected by VCCPD, the FTE links are activated and VCCPDf starts running on the 40GbE links.

Whenever a new 40GbE link is brought up on a Node or an Interconnect, this information is sent back to the Fabric Manager so that it can update its view of the topology. Note that any communication with the Fabric Manager is done using the CPE network.

Whenever such a change occurs (an FTE link is added or removed), the Fabric Manager recomputes the way data should be load balanced on the data plane. Note that the load balancing does not take place per packet or per prefix. The QFabric system applies an algorithm to find out the different FTE links through which other Nodes can be reached. Consider that a Node has only one FTE link connected to an Interconnect. At this point in time, the Node has only one way to reach the other Nodes. Now if another FTE link is connected, then the programming would be altered to make sure the next hop for some Nodes is FTE-1 and is FTE-2 for others.

With the help of both VCCPD and VCCPDf, the QFabric’s Director devices are able to get information about:

� How many devices, and which ones (Nodes and Interconnects), are part of the QFabric system (VCCPD).

� How the Nodes are connected to the Interconnects (VCCPDf).

At this point in time, the QFabric system becomes ready to start forwarding traffic.

Now let’s take a look at how VCCPD and VCCPDf become relevant when it comes to a real life QFabric solution. Consider this sequence of steps:

Page 20: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

18 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

1. The only connections present are the DG0-DG1 connections and the connections between the Director devices and the EX Series VC.

1.1. Note that DG0 and DG1 would assign IP addresses of 1.1.1.1 and 1.1.1.2 respectively to their bond0 links. This is the link over which the Director devices sync up with each other.

Figure 1.1 Only the Control Plane Connections are Up

1.2. The Fabric Manager VM running on the DGs would run VCCPD and the DGs will send VCCPD Hellos on their links to the EX Series VC. Note that there would be no VCCPD neighbors at this point in time as Nodes and Interconnects are yet to be connected. Also, the Control plane switches (EX Series VC) do not participate in the VCCPD adjacencies. Their function is only to provide a Layer 2 segment for all the components to communicate with each other.

Page 21: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter1:PhysicalConnectivityandDiscovery 19

Figure 1.2 Two Interconnects are Added to the CPE Network

2. In Figure 1.2 two Interconnects (IC-1 and IC-2) are connected to the EX Series VC.

2.1. The Interconnects start running VCCPD on the link connected to the EX Series VC. The EX Series VC acts as a Layer 2 switch and only floods the VCCPD packets.

2.2. The Fabric Manager VMs and the Interconnects see each other’s VCCPD Hellos and become neighbors. At this point in time, the DGs know that IC-1 and IC-2 are a part of the QFabric system.

Page 22: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

20 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

Figure 1.3 Two Nodes are Connected to the CPE Network

3. In Figure 1.3 two new Node devices (Node-1 and Node-2) are connected to the EX Series VC.

3.1. The Nodes start running VCCPD on the links connected to the EX Series VC. Now the Fabric Manager VMs know that there are four devices in the QFabric inventory: IC-1, IC-2, Node-1, and Node-2.

3.2. Note that none of the FTE interfaces of the Nodes are up yet. This means that there is no way for the Nodes to forward traffic (there is no data plane connectivity). Whenever such a condition occurs, Junos disables all the 10GbE interfaces on the Node devices. This is a security measure to make sure that a user cannot connect a production server to a Node device that doesn’t have any active FTE ports. This also makes troubleshooting very easy. If all the 10GbE ports of a Node device go down even when devices are connected to it, the first place to check should be the status of the FTE links. If none of the FTE links are in the up/up state, then all the 10GbE interfaces will be disabled. In addition to bringing down all the 10GbE ports, the QFabric system also raises a major system alarm. The alarms can be checked using the show system alarms CLI command.

Page 23: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter1:PhysicalConnectivityandDiscovery 21

Figure 1.4 Node-1 and Node-2 are Connected to IC-1 and IC-2, Respectively

4. In Figure 1.4 The following FTE links are connected:

4.1 Node-1 to IC-1.

4.2 Node-2 to IC-2.

4.3 The Nodes and the Interconnects will run VCCPDf on the FTE links and see each other.

4.4 This VCCPDf information is fed to the Director devices. At this point in time, the Directors know that:

� There are four devices in the QFabric system. This was established at point# 3.1.

� Node-1 is connected to IC-1 and Node-2 is connected to IC-2.

5. Note that some of the data plane of the QFabric is connected, but there would be no connectivity for hosts across Node devices. This is because Node-1 has no way to reach Node-2 via the data plane and vice-versa, as the Interconnects are never connected to each other. The only interfaces for the internal data plane of the QFabric system are the 40GbE FTE interfaces. In this particular example, Node-1 is connected to IC-1, but IC-1 is not connected to Node-2. Similarly, Node-2 is connected to IC-2, but IC-2 is not connected to Node-1. Hence, hosts connected behind Node-1 have no way of reaching hosts connected behind Node-2, and vice-versa.

Page 24: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

22 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

Figure 1.5 Node-1 is Connected to IC-2

6. In Figure 1.5 Node-1 is connected to IC-2. At this point, the Fabric Manager has the following information:

6.1 There are four devices in the QFabric system.

6.2 IC-1 is connected to Node-1.

6.3 IC-2 is connected to both Node-1 as well as Node-2.

� The Fabric Manager VM running inside the Director devices realizes that Node-1 and Node-2 now have mutual reachability via IC-2.

� FM programs the internal forwarding table of Node-1. Now Node-1 knows that to reach Node-2, it needs to have the next hop of IC-2.

� FM programs the internal forwarding table of Node-2. Now Node-2 knows that to reach Node-1, it needs to have the next-hop of IC-2.

6.4 At this point in time, hosts connected behind Node-1 should be able to communicate with hosts connected behind Node-2 (provided that the basic laws of networking like VLAN, routing, etc. are obeyed).

Page 25: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter1:PhysicalConnectivityandDiscovery 23

Figure 1.6 Node-2 is Connected to IC-1, which Completes the DataPlane of the QFabric System

7. In Figure 1.6 Node-2 is connected to IC-1.

7.1 The Nodes and IC-1 discover each other using VCCPDf and send this information to Fabric Manager VM running on the Directors.

7.2 Now the FM realizes that Node-1 can reach Node-2 via IC-1, also.

7.3 After the FM finishes programming the tables of Node-1 and Node-2, both Node devices will have two next hops to reach each other. These two next hops can be used for load-balancing purposes. This is where the QFabric solution provides excellent High Availability and also effective load balancing of different flows as we add more 40GbE uplinks to the Node devices.

At the end of all these steps, the internal VCCPD and VCCPDf adjacencies of the QFabric would be complete, and the Fabric Manager will have a complete topologi-cal view of the system.

Page 26: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

24 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

Test Your Knowledge

Q: Name the different physical components of a QFabric system.

� Nodes, Interconnects, Director devices, EX Series VCs.

Q: Which of these components are connected to each other?

� The EX Series VC is connected to the Nodes, Interconnects, and the Director devices.

� The Director devices are connected to each other.

� The Nodes are connected to the Interconnects.

Q: Where is the management IP address of the QFabric system configured?

� During installation, the user is prompted to enter a VIP. This VIP is used for remote management of the QFabric system.

Q: Which protocol is used for QFabric system discovery? Where does it run?

� VCCPD is used for system discovery and it runs on all the components and VMs. The adjacencies for VCCPD are established over the CPE network.

Q: Which protocol is used for QFabric data plane topology discovery? Where does it run?

� VCCPDf is used for discovering the topology on the data plane. VCCPDf runs only on the Nodes and the Interconnects, and adjacencies for VCCPDf are established on the 40GbE FTE interfaces.

Q: Which Junos process is responsible for the hardware inventory management of a system?

� Chassisd.

Page 27: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter 2

Accessing Individual Components

Logging In to Various QFabric Components . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Checking Logs at Individual Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Enabling and Retrieving Trace Options From a Component: . . . . . . . . . . . 35

Extracting Core Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Checking for Alarms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Inbuilt Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Test Your Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Page 28: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

26 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

Before this book demonstrates how to troubleshoot any problems, this chapter will educate the reader about logging in to different components and how to check and retrieve logs at different levels (physical and logical components) of a QFabric system. Details on how to configure a QFabric system and aliases for individual Nodes are documented in the QFabric Deployment Guide at www.juniper.net/documentation.

Logging In to Various QFabric Components

As discussed previously, a QFabric solution has many components, some of which can be physical (Nodes, Interconnects, Director devices, CPE, etc.), and some of which are logical VMs. One of the really handy features about QFabric is that it allows administrators (those with appropriate privileges) to log in to these individual components, an extremely convenient feature when an administrator needs to do some advanced troubleshooting that requires logging in to a specific component of the system.

The hardware inventory of any Juniper router or switch is normally checked using the show chassis hardware command. QFabric software also supports this com-mand, but there are multiple additions made to this command expressly for the QFabric solution, and these options allow users to check the hardware details of a particular component as well. For instance:

root@Test-QFABRIC> show chassis hardware ?Possible completions: <[Enter]> Execute this command clei-models Display CLEI barcode and model number for orderable FRUs detail Include RAM and disk information in output extensive Display ID EEPROM information interconnect-device Interconnect device identifier models Display serial number and model number for orderable FRUs node-device Node device identifier | Pipe through a commandroot@Test-QFABRIC> show chassis hardware node-device ? Possible completions: <node-device> Node device identifier BBAK0431 Node device Node0 Node device Node1 Node device--SNIP—

Consider the following QFabric system:

root@Test-QFABRIC> show fabric administration inventory Item Identifier Connection ConfigurationNode groupNW-NG-0 Connected Configured Node0 P6966-C Connected Node1 BBAK0431 Connected RSNG-1 Connected Configured Node2 P4423-C Connected Node3 P1377-C Connected RSNG-2 Connected Configured Node4 P6690-C Connected Node5 P6972-C Connected Interconnect deviceIC-A9122 Connected Configured A9122/RE0 Connected A9122/RE1 Connected IC-IC001 Connected Configured IC001/RE0 Connected

Page 29: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter2:AccessingIndividualComponents 27

IC001/RE1 Connected Fabric managerFM-0 Connected Configured Fabric controlFC-0 Connected Configured FC-1 Connected Configured Diagnostic routing engineDRE-0 Connected Configured

This output shows the alias and the serial number (mentioned under the Identifier column) of every Node that is a part of the QFabric system. It also shows if the Node is a part of an SNG, Redundant-SNG, or the Network Node Group.

The rightmost column of the output shows the state of each component. Each component of the QFabric should be in Connected state. If a component shows up as Disconnected, then there must be an underlying problem and troubleshooting is required to find out the root cause.

As shown in Figure 2.1, this particular fabric system has six Nodes and two Inter-connects. Node-0 and Node-1 are part of the Network Node Group, Node-2 and Node-3 are part of a Redundant-SNG named RSNG-1, Node-4 and Node-5 are part of another Redundant-SNG named RSNG-2.

Figure 2.1 Visual Representation of the QFabric System of This Section

There are two ways of accessing the individual components:

� From the Linux prompt of the Director devices

� From the QFabric CLI

AccessingComponentsFromtheDGs

All the components are assigned an IP address in the 169.254 IP-range during the system discovery phase. Note that IP addresses in the 169.254.193.x range are used

Page 30: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

28 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

to allot IP addresses to Node groups and the Interconnects and IPs in the 169.254.128.x range are allotted to Node devices and to VMs. These IP addresses are used for internal management and can be used to log in to individual components from the Director device’s Linux prompt. The IP addresses of the components can be seen using the dns.dump utility, which is located under /root on the Director devices. Here is an example showing sample output from dns.dump and explaining how to log in to various components:

[root@dg0 ~]# ./dns.dump

; <<>> DiG 9.3.6-P1-RedHat-9.3.6-4.P1.el5 <<>> -t axfr pkg.dcbg.juniper.net @169.254.0.1;; global options: printcmdpkg.dcbg.juniper.net. 600 IN SOA ns.pkg.dcbg.juniper.net. mail.pkg.dcbg.juniper.net. 104 3600 600 7200 3600pkg.dcbg.juniper.net. 600 IN NS ns.pkg.dcbg.juniper.net.pkg.dcbg.juniper.net. 600 IN A 169.254.0.1pkg.dcbg.juniper.net. 600 IN MX 1 mail.pkg.dcbg.juniper.net.dcfnode---DCF-ROOT.pkg.dcbg.juniper.net. 45 IN A 169.254.192.17 <<<<<<< DCF-Root (FM's) IP addressdcfnode---DRE-0.pkg.dcbg.juniper.net. 45 IN A 169.254.3.3dcfnode-3b46cd08-9331-11e2-b616-00e081c53280.pkg.dcbg.juniper.net. 45 IN A 169.254.128.15dcfnode-3d9b998a-9331-11e2-bbb2-00e081c53280.pkg.dcbg.juniper.net. 45 IN A 169.254.128.16dcfnode-4164145c-9331-11e2-a365-00e081c53280.pkg.dcbg.juniper.net. 45 IN A 169.254.128.17dcfnode-43b35f38-9331-11e2-99b1-00e081c53280.pkg.dcbg.juniper.net. 45 IN A 169.254.128.18dcfnode-A9122-RE0.pkg.dcbg.juniper.net. 45 IN A 169.254.128.5dcfnode-A9122-RE1.pkg.dcbg.juniper.net. 45 IN A 169.254.128.8dcfnode-BBAK0431.pkg.dcbg.juniper.net. 45 IN A 169.254.128.20dcfnode-default---FABC-INE-A9122.pkg.dcbg.juniper.net. 45 IN A 169.254.193.0dcfnode-default---FABC-INE-IC001.pkg.dcbg.juniper.net. 45 IN A 169.254.193.1dcfnode-default---NW-INE-0.pkg.dcbg.juniper.net. 45 IN A 169.254.192.34 <<<< NW-INE's IP addressdcfnode-default---RR-INE-0.pkg.dcbg.juniper.net. 45 IN A 169.254.192.35 <<<< FC-0's IP addressdcfnode-default---RR-INE-1.pkg.dcbg.juniper.net. 45 IN A 169.254.192.36dcfnode-default-RSNG-1.pkg.dcbg.juniper.net. 45 IN A 169.254.193.11dcfnode-default-RSNG-2.pkg.dcbg.juniper.net. 45 IN A 169.254.193.12dcfnode-IC001-RE0.pkg.dcbg.juniper.net. 45 IN A 169.254.128.6 <<<< IC's IP addressdcfnode-IC001-RE1.pkg.dcbg.juniper.net. 45 IN A 169.254.128.7dcfnode-P1377-C.pkg.dcbg.juniper.net. 45 IN A 169.254.128.21dcfnode-P4423-C.pkg.dcbg.juniper.net. 45 IN A 169.254.128.19dcfnode-P6690-C.pkg.dcbg.juniper.net. 45 IN A 169.254.128.22dcfnode-P6966-C.pkg.dcbg.juniper.net. 45 IN A 169.254.128.24 <<<<< node's IP addressdcfnode-P6972-C.pkg.dcbg.juniper.net. 45 IN A 169.254.128.23 <<<<< node's IP addressmail.pkg.dcbg.juniper.net. 600 IN A 169.254.0.1ns.pkg.dcbg.juniper.net. 600 IN A 169.254.0.1server.pkg.dcbg.juniper.net. 600 IN A 169.254.0.1--snip--

These (boldfaced)169.254.0.x IP addresses can be used to log in to individual components from the Director devices:

1. Log in to the Network Node Group VM. As seen in the preceding output, the IP address for the Network Node Group VM is 169.254.192.34. CLI snippets show a log in attempt to this IP address:

[root@dg0 ~]# ssh [email protected] authenticity of host '169.254.192.34 (169.254.192.34)' can't be established.RSA key fingerprint is 49:f0:9b:a0:bb:36:56:87:dd:c5:c5:21:2c:a6:71:e3.Are you sure you want to continue connecting (yes/no)? yesWarning: Permanently added '169.254.192.34' (RSA) to the list of known [email protected]'s password: --- JUNOS 12.2X50-D41.1 built 2013-03-22 21:44:05 UTCroot@NW-NG-0% cli

Page 31: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter2:AccessingIndividualComponents 29

{master}root@NW-NG-0> exit

root@NW-NG-0% exitlogoutConnection to 169.254.192.34 closed.

2. Log in to FM. The IP address for Fabric Manager VM is 169.254.192.17.

[root@dg0 ~]# ssh [email protected] The authenticity of host '169.254.192.17 (169.254.192.17)' can't be established.RSA key fingerprint is 49:f0:9b:a0:bb:36:56:87:dd:c5:c5:21:2c:a6:71:e3.Are you sure you want to continue connecting (yes/no)? yesWarning: Permanently added '169.254.192.17' (RSA) to the list of known [email protected]'s password: --- JUNOS 12.2X50-D41.1 built 2013-03-22 21:44:05 UTCroot@FM-0% root@FM-0%

Similarly, the corresponding IP addresses mentioned in the output of dns.dump can be used to log in to other components like the Fabric Control VM or the Nodes or the Interconnects.

NOTE The Node devices with serial numbers P1377-C and P4423-C are a part of the Node group named RSNG-1. This information is present in the output of show fabric administration inventory shown above.

As mentioned previously, the RSNG abstraction works on the concept of a Virtual Chassis. Here is a CLI snippet showing the result of a login attempt to the IP addresses of the Nodes which are a part of RSNG-1:

[root@dg0 ~]# ./dns.dump | grep RSNG-1dcfnode-default-RSNG-1.pkg.dcbg.juniper.net. 45 IN A 169.254.193.11dcfnode-default-RSNG-1.pkg.dcbg.juniper.net. 45 IN A 169.254.193.11[root@dg0 ~]# ssh [email protected]@169.254.193.11's password: --- JUNOS 12.2X50-D41.1 built 2013-03-22 21:43:51 UTC

root@RSNG-1% root@RSNG-1% root@RSNG-1% cli{master} <<<<<<<<< ‘Master’ prompt. Chapter-3 discusses more about master/backup RE’s within various Node-Groupsroot@RSNG-1> show virtual-chassis

Preprovisioned Virtual ChassisVirtual Chassis ID: 0000.010b.0000 MstrMember ID Status Model prio Role Serial No0 (FPC 0) Prsnt qfx3500 128 Master* P4423-C 1 (FPC 1) Prsnt qfx3500 128 Backup P1377-C

{master}root@RSNG-1>

And here logging in to the RSNG-Master (P4423-C):

[root@dg0 ~]# ./dns.dump | grep P4423dcfnode-P4423-C.pkg.dcbg.juniper.net. 45 IN A 169.254.128.19dcfnode-P4423-C.pkg.dcbg.juniper.net. 45 IN A 169.254.128.19

Page 32: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

30 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

[root@dg0 ~]# ssh [email protected] authenticity of host '169.254.128.19 (169.254.128.19)' can't be established.RSA key fingerprint is 9e:aa:da:bb:8d:e4:1b:74:0e:57:af:84:80:c3:a8:9d.Are you sure you want to continue connecting (yes/no)? yesWarning: Permanently added '169.254.128.19' (RSA) to the list of known [email protected]'s password: --- JUNOS 12.2X50-D41.1 built 2013-03-22 21:43:51 UTCroot@RSNG-1% root@RSNG-1% cli{master} <<<<<<<< RSNG-master prompt

3. Log in to RSNG-backup. With the above-mentioned information, it’s clear that logging in to the RSNG-backup will take users to the RSNG-backup prompt. Captures from the device are shown here:

[root@dg0 ~]# ./dns.dump | grep P1377dcfnode-P1377-C.pkg.dcbg.juniper.net. 45 IN A 169.254.128.21dcfnode-P1377-C.pkg.dcbg.juniper.net. 45 IN A 169.254.128.21[root@dg0 ~]# ssh [email protected]@169.254.128.21's password: --- JUNOS 12.2X50-D41.1 built 2013-03-22 21:43:51 UTCroot@RSNG-1-backup%

This is expected behavior as an RSNG works on the concept of Juniper’s Virtual Chassis technology.

If any component level troubleshooting needs to be done at an RSNG level, then the user must log in to the RSNG master Node to proceed. This is because the Routing Engine of the RSNG master Node will be active at all times and logs collected from this Node would be relevant to the Redundant SNG abstraction.

4. Log in to the line cards of the NW-NG-0 VM. As discussed earlier, the RE functionality for the NW-NG-0 is located as redundant Virtual Machines on the DGs. This means that the REs on the line cards are deactivated. Consider the following configuration on the Network Node Group VM on this QFabric system:

[root@dg0 ~]# ./dns.dump | grep NW-INdcfnode-default---NW-INE-0.pkg.dcbg.juniper.net. 45 IN A 169.254.192.34dcfnode-default---NW-INE-0.pkg.dcbg.juniper.net. 45 IN A 169.254.192.34[root@dg0 ~]# [root@dg0 ~]# ssh [email protected] JUNOS 12.2X50-D41.1 built 2013-03-22 21:44:05 UTCroot@NW-NG-0% cliroot@NW-NG-0> show virtual-chassis Preprovisioned Virtual ChassisVirtual Chassis ID: 0000.0022.0000 MstrMember ID Status Model prio Role Serial No0 (FPC 0) Prsnt qfx3500 0 Linecard P6966-C 1 (FPC 1) Prsnt qfx3500 0 Linecard BBAK0431 8 (FPC 8) Prsnt fx-jvre 128 Backup 3b46cd08-9331-11e2-b616-00e081c532809 (FPC 9) Prsnt fx-jvre 128 Master* 3d9b998a-9331-11e2-bbb2-00e081c53280{master}

Nodes P6966-C and BBAK0431 are the line cards of this NW-NG-0 VM. Since the REs of these Node devices are not active at all, there is no configuration that is pushed down to the line cards. Here are the snippets from the login prompt of the member Nodes of the Network Node Group:

[root@dg0 ~]# ./dns.dump | grep P6966dcfnode-P6966-C.pkg.dcbg.juniper.net. 45 IN A 169.254.128.24dcfnode-P6966-C.pkg.dcbg.juniper.net. 45 IN A 169.254.128.24

Page 33: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter2:AccessingIndividualComponents 31

[root@dg0 ~]# ssh [email protected] authenticity of host '169.254.128.24 (169.254.128.24)' can't be established.RSA key fingerprint is f6:64:18:f5:9d:8d:29:e7:95:c0:d7:4f:00:a7:3d:30.Are you sure you want to continue connecting (yes/no)? yesWarning: Permanently added '169.254.128.24' (RSA) to the list of known [email protected]'s password: Permission denied, please try again.

Note that since no configuration is pushed down to the line cards in the case of NW-NG-0, it means that a user can’t log in to the line cards (the credentials wouldn’t work as the configuration is not pushed to the line cards at all). You need to connect to the line cards using their console if you intend to check details on the NWNG line cards. Also, note that the logs from the line cards will reflect on the logs located in /var/log/messages file on the NW-NG-0 VM. One more method to log in to the Nodes belonging to the NW-NG is to telnet to them. However, note that no configuration is visible on these Nodes as their Routing Engines are disabled and hence no configuration is pushed to them.

5. From the CLI. Logging in to individual components requires user-level privileges, which allow such logins. The remote-debug-permission CLI setting needs to be configured for this. Here is the configuration used on QFabric mentioned in this chapter:

root@Test-QFABRIC> show configuration system host-name Test-QFABRIC;authentication-order [ radius password ];root-authentication { encrypted-password "$1$LHY6NN4P$cnOMoqUj4OXKMaHOm2s.Z."; ## SECRET-DATA remote-debug-permission qfabric-admin;}

There are three permissions that you can set at this hierarchy:

� qfabric-admin: Permits a user to log in to individual QFabric switch compo-nents, issue show commands, and to change component configurations.

� qfabric-operator: Permits a user to log in to individual QFabric switch compo-nents and issue show commands.

� qfabric-user: Prevents a user from logging in to individual QFabric switch components.

Also, note that a user needs to have admin control privileges to add this statement to the device’s configuration.

MORE? Complete details on QFabric’s system login classes can be found at this link: http://www.juniper.net/techpubs/en_US/junos13.1/topics/concept/access-login-class-qfab-ric-overview.html.

Once a user has the required remote debug permission set, they can access the individual components using the request component login command:

root@Test-QFABRIC> request component login ?Possible completions: <node-name> Inventory name for the remote node A9122/RE0 Interconnect device control board A9122/RE1 Interconnect device control board BBAK0431 Node device --SNIP--

Page 34: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

32 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

Checking Logs at Individual Components

On any device running Junos, you can find the logs under the /var/log directory. The same holds true for a QFabric system, but since a QFabric system has multiple components you have different points from where the logs can be collected:

� SNG Nodes

� RSNG Nodes

� NW-NG VM

� Central SFC (QFabric CLI prompt)

� DG’s Linux prompt

The log collection for SNG, Redundant SNG, and Network Node Group abstrac-tions are straightforward. All the logs pertaining to these Node groups are saved locally (on the device/VM/active-RE’s filesystem) under the /var/log directory. Note that this location on the NW-INE line cards will not yield any useful information, as the RE on the line cards is not active. Also, in accordance with standard Junos behavior, the usual show log <filename> command is valid from the active REs of these Node groups.

The logs collected under /var/log/messages file on the DGs are a collection of all the logs from all the components. In addition to those logs, this file also records impor-tant logs that are relevant to the healthy functionality of the DGs (mgd, mysql-data-base, DG-sync related messages, etc.). Since Junos 13.1, you can issue the show log messages Node device <name> to check the log messages from a specific component. In older releases, you need to make use of the match keyword to see all the logs that have the name of that particular Node device as a part of the message. So from older releases:

root@Test-QFABRIC> show log messages | match RSNG-1 | last 10 Apr 08 01:58:40 Test-QFABRIC: QFABRIC_INTERNAL_SYSLOG: RSNG-1-backup: - last message repeated 3 timesApr 08 01:58:44 Test-QFABRIC chassism[1446]: QFABRIC_INTERNAL_SYSLOG: RSNG-1-backup: - Fan 2 is NOT spinning correctly

And from 13.1 onwards, you can check the logs for a specific component from the QFabric CLI:

root@qfabric> show log messages ?Possible completions: <[Enter]> Execute this command <component> director-device Show logs from a director device infrastructure-device Show logs from a infrastructure device interconnect-device Show logs from a interconnect device node-device Show logs from a node device | Pipe through a commandroot@qfabric>

HowtoLogintoDifferentComponentsandFetchLogs

Note that the Identifier column does not show the serial number of the Nodes in this case. This is because no aliases have been assigned to the Nodes. For such a system, the serial numbers of the Nodes are used as their aliases.

Checking the logs of an SNG:

Page 35: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter2:AccessingIndividualComponents 33

root@qfabric> show fabric administration inventory Item Identifier Connection ConfigurationNode group BBAK1280 Connected Configured BBAK1280 Connected BBAM7499 Connected Configured BBAM7499 Connected BBAM7543 Connected Configured BBAM7543 Connected BBAM7560 Connected Configured BBAM7560 Connected BBAP0747 Connected Configured BBAP0747 Connected BBAP0748 Connected Configured BBAP0748 Connected BBAP0750 Connected Configured BBAP0750 Connected BBPA0737 Connected Configured BBPA0737 Connected NW-NG-0 Connected Configured BBAK6318 Connected BBAM7508 Connected P1602-C Connected Configured P1602-C Connected P2129-C Connected Configured P2129-C Connected P3447-C Connected Configured P3447-C Connected P4864-C Connected Configured P4864-C Connected Interconnect device IC-BBAK7828 Connected Configured BBAK7828/RE0 Connected IC-BBAK7840 Connected Configured BBAK7840/RE0 Connected IC-BBAK7843 Connected Configured BBAK7843/RE0 Connected Fabric manager FM-0 Connected Configured Fabric control FC-0 Connected Configured FC-1 Connected Configured Diagnostic routing engine DRE-0 Connected Configured root@qfabric>

[root@dg0 ~]# ./dns.dump | grep BBAK1280dcfnode-BBAK1280.pkg.dcbg.juniper.net. 45 IN A 169.254.128.7dcfnode-default---BBAK1280.pkg.dcbg.juniper.net. 45 IN A 169.254.193.2dcfnode-BBAK1280.pkg.dcbg.juniper.net. 45 IN A 169.254.128.7dcfnode-default---BBAK1280.pkg.dcbg.juniper.net. 45 IN A 169.254.193.2[root@dg0 ~]# ssh [email protected] authenticity of host '169.254.128.7 (169.254.128.7)' can't be established.RSA key fingerprint is a3:3e:2f:65:9d:93:8f:e3:eb:83:08:c3:01:dc:b9:c1.Are you sure you want to continue connecting (yes/no)? yesWarning: Permanently added '169.254.128.7' (RSA) to the list of known hosts.Password:--- JUNOS 13.1I20130306_1309_dc-builder built 2013-03-06 14:56:57 UTCroot@BBAK1280%

After logging in to a component, the logs can be viewed either by using the show log <filename> Junos CLI command or by logging in to the shell mode and checking out the contents of the /var/log directory:

Page 36: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

34 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

root@BBAK1280> show log messages? Possible completions: <filename> Name of log file messages Size: 185324, Last changed: Jun 02 20:59:46 messages.0.gz Size: 4978, Last changed: Jun 01 00:45:00--SNIP--root@BBAK1280> show log chassisd? Possible completions: <filename> Name of log file chassisd Size: 606537, Last changed: Jun 02 20:41:04 chassisd.0.gz Size: 97115, Last changed: May 18 05:36:24root@BBAK1280> root@BBAK1280> exit

root@BBAK1280% cd /var/logroot@BBAK1280% ls -lrt | grep messages-rw-rw---- 1 root wheel 4630 May 11 13:45 messages.9.gz-rw-rw---- 1 root wheel 4472 May 13 21:45 messages.8.gz--SNIP--root@BBAK1280%

This is true for other components as well (Nodes, Interconnects, and VMs). How-ever, the rule of checking logs only at the active RE of a component still applies.

CheckingtheLogsattheDirectorDevices

Checking the logs at the Director devices is quite interesting because the Director devices are the brains of QFabric and they run a lot of processes/services that are critical to the health of the system. Some of the most important logs on the Director devices can be found here:

� /var/log

� /tmp

� /vmm

/var/log

As with any other Junos platform, /var/log is a very important location as far as log collection is concerned. But there is also the /var/log/messages file, which records the general logs that are recorded for the DG devices.

[root@dg0 tmp]# cd /var/log[root@dg0 log]# lsadd_device_dg0.log cron.4.gz messages secure.3.gzanaconda.log cups messages.1.gz secure.4.gz--SNIP--

/tmp

This is the location that contains all the logs pertaining to the configuration push events within the subdirectory named sfc-captures. Whenever a configuration is committed on the QFabric CLI, it is pushed to the various components. The logs pertaining to these processes can be found at this location and in all the core files:

[root@dg0 sfc-captures]# cd /tmp[root@dg0 tmp]# ls1296.sfcauth 26137.sfcauth 32682.sfcauth corefiles--SNIP--[root@dg0 tmp]# cd sfc-captures/

Page 37: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter2:AccessingIndividualComponents 35

[root@dg0 sfc-captures]# ls0317 0323 0329 0335 0341 0347 0353 0359 0365 0371 03770318 0324 0330 0336 0342 0348 0354 0360 0366 0372 03780319 0325 0331 0337 0343 0349 0355 0361 0367 0373 last.txt0320 0326 0332 0338 0344 0350 0356 0362 0368 0374 misc0321 0327 0333 0339 0345 0351 0357 0363 0369 0375 sfc-database0322 0328 0334 0340 0346 0352 0358 0364 0370 0376

Enabling and Retrieving Trace Options From a Component

After logging in to a specific component, you can use show commands from the operational mode. However, access to configuration mode is not allowed:

root@Test-QFABRIC> request component login NW-NG-0 Warning: Permanently added 'dcfnode-default---nw-ine-0,169.254.192.34' (RSA) to the list of known hosts.--- JUNOS 12.2X50-D41.1 built 2013-03-22 21:44:05 UTC{master}qfabric-admin@NW-NG-0>

qfabric-admin@NW-NG-0> conf <<<< configure-mode inaccessible ^unknown command.{master}qfabric-admin@NW-NG-0> edit ^unknown command.{master}

A big part of troubleshooting any networking issue is enabling trace options and analyzing the logs. To enable trace options on a specific component, the user needs to have superuser access. Here is how that can be done:

qfabric-admin@NW-NG-0> start shell % suPassword:root@NW-NG-0% cli{master}fabric-admin@NW-NG-0> configure Entering configuration mode

{master}[edit]qfabric-admin@NW-NG-0# set protocols ospf traceoptions flag all

{master}[edit]qfabric-admin@NW-NG-0# commit <<<< commit at component-levelcommit complete

{master}[edit]

NOTE If any trace options are enabled at a component level, and a commit is done from the QFabric CLI, then the trace options configured at the component will be removed.

MORE? There is another method of enabling trace options on QFabric and it is documented at the following KB article: http://kb.juniper.net/InfoCenter/index?page=content&id=KB21653.

Whenever trace options are configured at a component level, the corresponding file containing the logs is saved on the file system of the active RE for that component.

Page 38: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

36 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

Note that there is no way that an external device connected to QFabric can connect to the individual components of a QFabric system.

Because the individual components can be reached only by the Director devices, and the external devices (say, an SNMP server) can only connect to the DGs as well, you need to follow this procedure to retrieve any files that are located on the file system of a component:

1. Save the file from the component on to the DG.

2. Save the file from the DG to the external server/device.

This is because the management of the whole QFabric system is done using the VIP that is allotted to the DGs. Since QFabric is made up of a lot of physical compo-nents, always consider a QFabric system as a network of different devices. These different components are connected to each other on a common LAN segment, which is the control plane Ethernet segment. In addition to this, all the components have an internal management IP address in the 169.254 IP address range. These IP addresses can be used to copy files between different components.

Here is an example of how to retrieve log files from a component (NW-NG in this case):

root@NW-NG-0% ls -lrt /var/log | grep ospf-rw-r----- 1 root wheel 59401 Apr 8 08:26 ospf-traces <<<< the log file is saved at /var/log on the NW-INE VMroot@NW-NG-0% exitlogoutConnection to 169.254.192.34 closed.[root@dg0 ~]# ./dns.dump | grep NW-INEdcf-default---NW-INE-0.pkg.dcbg.juniper.net. 45 IN A 169.254.192.34dcf-default---NW-INE-0.pkg.dcbg.juniper.net. 45 IN A 169.254.192.34[root@dg0 ~]# [root@dg0 ~]# [root@dg0 ~]# ls -lrt | grep ospf[root@dg0 ~]# [root@dg0 ~]# [root@dg0 ~]# scp [email protected]://var/log/ospf-traces [email protected]'s password: ospf-traces 100% 59KB 59.0KB/s 00:00 [root@dg0 ~]# ls -lrt | grep ospf -rw-r----- 1 root root 60405 Apr 8 01:27 ospf-traces

Here, you’ve successfully transferred the log file to the DG. Since the DGs have management access to the gateway, you can now transfer this file out of the QFabric system and onto the required location.

Extracting Core Files

There might be a situation in which one of the processes of a component writes a core file onto the file system. The core files are also saved at a specific location on the Director devices as well. Consider the following output:

root@TEST-QFABRIC> show system core-dumpsRepository scope: sharedRepository head: /pbdata/exportList of nodes for core repository: /pbdata/export/rdumps/ <<<< All the cores are saved here

Just like trace options, core files are also saved locally on the file system of the components. These files can be retrieved the same way as trace option files are retrieved:

Page 39: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter2:AccessingIndividualComponents 37

� First, save the core file from the component onto the DG.

� Once the file is available on the DG it can be accessed via other devices that have IP connectivity to the Director devices.

Checking for Alarms

QFabric components have LEDs that can show or blink red or amber in case there is an alarm.

In addition to this, the administrator can check the active alarms on a QFabric system by executing the show chassis alarms command from the CLI, and the output shows the status of alarms for all the components of a QFabric system.

Since QFabric has many different Nodes and IC devices, you have additional CLI extensions to the show chassis alarms command to be able to check the alarms related to a specific Node/IC, as shown in this Help output:

root@qfabric> show chassis alarms ?Possible completions: <[Enter]> Execute this command interconnect-device Interconnect device identifier node-device Node device identifier | Pipe through a command

Inbuilt Scripts

There are several inbuilt scripts in the QFabric system that can be run to check the health of or gather additional information about the system. These scripts are present in the /root directory of the DGs. Most of the inbuilt scripts are leveraged by the system in the background (to do various health checks on the QFabric system). The names of the scripts are very intuitive and here are a few that can be extremely useful:

� dns.dump: Shows the IP addresses corresponding to all the components (it’s already been used multiple times in this book).

� createpblogs: This script gathers the logs from all the components and stores it as /tmp/pblogs.tgz. From Junos 12.3 and up, this log file is saved at /pbdata/export/rlogs/ location. This script is extremely useful to have when trouble-shooting QFabric. Best practice suggests running this script before and after every major change that is done on the QFabric system. That way you’ll know how the logs looked before and then after the change, something useful for both JTAC and yourself when it comes time to troubleshoot issues.

� pingtest.sh: This script pings all the components of the QFabric system and reports their status. If any of the Nodes are not reachable, then a suitable status is shown for that Node. Here is what a sample output would look like:

[root@dg1 ~]# ./pingtest.sh----> Detected new host dcfnode---DCF-ROOTdcfnode---DCF-ROOT - ok----> Detected new host dcfnode---DRE-0dcfnode---DRE-0 - ok----> Detected new host dcfnode-13daf6fc-9b6c-11e2-bafc-00e081ce1e76dcfnode-13daf6fc-9b6c-11e2-bafc-00e081ce1e76 - ok----> Detected new host dcfnode-150d8a4e-9b6c-11e2-a1ae-00e081ce1e76dcfnode-150d8a4e-9b6c-11e2-a1ae-00e081ce1e76 - ok----> Detected new host dcfnode-16405946-9b6c-11e2-a345-00e081ce1e76dcfnode-16405946-9b6c-11e2-a345-00e081ce1e76 - ok

Page 40: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

38 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

----> Detected new host dcfnode-17732b54-9b6c-11e2-a937-00e081ce1e76dcfnode-17732b54-9b6c-11e2-a937-00e081ce1e76 - ok----> Detected new host dcfnode-226b5716-9b80-11e2-aea7-00e081ce1e76--snip--

� dcf_sfc_show_versions: Shows the software version (revision number) running on the SFC component. Also shows the versions of various daemons running on the system.

CAUTION Certain scripts can cause some traffic disruption and hence should never be run on a QFabric system that is carrying production traffic, for instance: format.sh, dcf_sfc_wipe_cluster.sh, reset_initial_configuration.sh.

Test Your Knowledge

Q: Which CLI command can be used to view the hardware inventory of the Nodes and Interconnects?

� The show chassis hardware command can be used to view the hardware inventory. This is a Junos command that is supported on other Juniper plat-forms as well. For a QFabric system, additional keywords can be used to view the hardware inventory of a specific Node or Interconnect.

Q: Which CLI command can be used to display all the individual components of a QFabric system?

� The show fabric administration inventory command lists all the compo-nents of the QFabric system and their current states.

Q: What are the two ways to log in to the individual components?

� From the Linux prompt of the Director devices.

� From the CLI using the request component login command.

Q: What is the IP address range that is allocated to Node groups and Node devices?

� Node devices: 169.254.128.x

� Node groups: 169.254.193.x

Q: What inbuilt script can be used to obtain the IP address allocated to the different components of a QFabric system?

� The dns.dump script is located at /root directory on the Director devices.

Page 41: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter 3

Control Plane and Data Plane Flows

Control Plane and Data Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40

Routing Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40

Route Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Maintaining Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Distributing Routes to Different Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Differences Between Control Plane Traffic and Internal Control Plane Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Test Your Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Page 42: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

40 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

One of the goals of this book is to help you efficiently troubleshoot an issue on a QFabric system. To achieve this, it’s important to understand exactly how the internal QFabric protocols operate and the packet flow of both the data plane and control plane traffic.

Control Plane and Data Plane

Juniper’s routing and switching platforms, like the MX Series and the EX Series, all implement the concept of separating the data plane from the control plane. Here is a quick explanation:

� The control plane is responsible for a device’s interaction with other devices and for running various protocols. The control plane of a device resides on the CPU and is responsible for forming adjacencies and peerings, and for learning routes (Layer 2 or Layer 3). The control plane sends the information about these routes to the data plane.

� The data plane resides on the chip or ASIC and this is where the actual packet forwarding takes place. Once the control plane sends information about specific routes to the data plane, the forwarding tables on the ASIC are populated accordingly. The data plane takes care of functions like forwarding, QoS, filtering, packet-parsing, etc. The performance of a device is determined by the quality of its data plane (also called the Packet Forwarding Engine or PFE).

Routing Engines

This chapter discusses the path of packets for control plane and data plane traffic. The following are bulleted lists about the protocols run on these abstractions or Node groups.

ServerNodeGroup(SNG)

� As previously discussed, when a Node is connected to a QFabric system for the first time, it comes up as an SNG. It’s considered to be a Node group with only one Node.

� The SNG is designed to be connected to servers and devices that do not need cross-Node resiliency.

� The SNG doesn’t run any routing protocols, needs to run only host-facing protocols like LACP, LLDP, ARP.

� The Routing Engine functionality is present on the local CPU. This means that MAC-addresses are learned locally for the hosts that are connected directly to the SNG.

� The local PFE has the data plane responsibilities.

� See Figure 3.1.

Page 43: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter3:ControlPlaneandDataPlaneFlows 41

Figure 3.1 Server Node Group (SNG)

RedundantServerNodeGroup(RSNG)

� Two independent SNGs can be combined (using configuration) to become an RSNG.

� The RSNG is designed to be connected to servers/devices that need cross-Node resiliency.

� Common design: At least one NIC of a server is connected to each Node of an RSNG. These ports are bundled together as a LAG (LACP or static-LAG).

� Doesn’t run any routing-protocols, needs to run only ‘host-facing’ protocols like LACP, LLDP, and ARP.

� The Routing Engine functionality is active/passive (only one Node has the active RE, the other stays in backup mode). This means that the MAC address-es of switches/hosts connected directly to the RSNG-Nodes are learned on the active-RE of the RSNG.

� The PFEs of both the Nodes are active and forward traffic at all times.

� See Figure 3.2.

Page 44: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

42 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

Figure 3.2 Redundant Server Node Group (RSNG)

NetworkNodeGroup(NW-NG)

� Up to eight Nodes can be configured to be part of the NW-NG.

� The NW-NG is designed provide connectivity to routers, firewalls, switches, etc.

� Common design: Nodes within NW-NG connect to routers, firewalls, or other important Data Center devices like load-balancers, filters, etc.

� Runs all the protocols available on RSNG/SNG. It can also run protocols like RIP, OSPF, BGP, xSTP, PIM, etc.

� The Routing Engine functionality is located on VMs that run on the DGs; these VMs are active/passive. The REs of the Nodes are disabled. This means that the MAC addresses of the devices connected directly to the NW-NG are learned on the active NW-NG-VM. Also, if the NW-NG is running any Layer 3 protocols with the connected devices (OSPF, BGP, etc.), then the routes are also learned by the active NW-NG-VM.

� The PFE of the Nodes within an NW-NG is active at all times.

� See Figure 3.3.

Page 45: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter3:ControlPlaneandDataPlaneFlows 43

Figure 3.3 Network Node Group

Route Propagation

As with any other internetworking device, the main job of QFabric is to send traffic end-to-end. To achieve this, the system needs to learn various kinds of routes (such as Layer 2 routes, Layer 3 routes, ARP, etc.).

As discussed earlier, there can be multiple active REs within a single QFabric system. Each of these REs can learn routes locally, but a big part of understanding how QFabric operates is to know how these routes are exchanged between various REs within the system.

One approach to exchanging these routes between different REs is to send all the routes learned on one RE to all the other active REs. While this is simple to do, such an implementation will be counter productive because all the routes eventually will need to be pushed down to the PFE so that hardware forwarding can take place. If you send all the routes to every RE, then the complete scale of the QFabric comes down to the table limits of a single PFE. It means that the scale of the complete QFabric solution is as good as the scale of a single RE. This is undesirable and the next section discusses how Juniper’s QFabric technology maintains scale with a distributed architecture.

Page 46: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

44 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

Maintaining Scale

One of the key advantages of the QFabric architecture is its scale. The scaling num-bers of MAC addresses and IP addresses obviously depend on the number of Nodes that are a part of a QFabric system because the data plane always resides on the Nodes and you need the routes to be programmed in the PFE (the data plane) to ensure end-to-end traffic forwarding.

As discussed earlier, all the routes learned on an RE are not sent to every other RE. Instead, an RE receives only the routes that it needs to forward data. This poses a big question: What parameters decide if a route should be sent to a Node’s PFE or not?

The answer is: it depends on the kind of route. The deciding factor for a Layer 2 route is different from the factor for a Layer 3 route. Let’s examine them briefly to under-stand these differences.

Layer2Routes

A Layer 2 route is the combination of a VLAN and a MAC address (VLAN-MAC pair) and the information stored in the Ethernet switching table of any Juniper EX Series switch. Now, Layer 2 traffic can be either unicast or BUM (Broadcast, Un-known-unicast, or Multicast in which all three kinds of traffic would be flooded within the VLAN).

Figure 3.4 is a representation of a QFabric system where Node-1 has active ports in VLANs 10 and 20 connected to it, Node-2 has hosts in VLANs 20 and 30 connected to it, and both Node-3 and Node-4 have hosts in VLANs 30 and 40 connected to it. Active ports means that the Nodes either have hosts directly connected to them, or that the hosts are plugged into access switches and these switches plug into the Nodes. For the sake of simplicity, let’s assume that all the Nodes shown in Figure 3.4 are SNGs, meaning that for this section, the words Node and RE can be used inter-changeably.

Figure 3.4 A Sample of QFabric’s Nodes, ICs, and connected Hosts

Page 47: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter3:ControlPlaneandDataPlaneFlows 45

Layer 2 Unicast Traffic

Consider that Host-2 wants to send traffic to Host-3, and let’s assume that the MAC address of Host-3 is already learned. This would cause traffic to be Layer 2 Unicast traffic as both the source and destination devices are in the same VLAN. When Node-1 sees this traffic coming in from Host-2, all of it should be sent over to Node-2 internally within the QFabric example in Figure 3.4. When Node-2 receives this traffic, it should be sent unicast to the port when the host is connected. This kind of communication means:

� Host-3’s MAC address is learned on Node-2. There should be some way to send this layer-2-route’s information over to Node-1. Once Node-1 has this information, it knows that everything destined to Host-3’s MAC address should be sent to Node-2 over the data plane of the QFabric.

� This is true for any other host in VLAN-20 that is connected on any other Node.

� Note that if Host-5 wishes to send some traffic to Host-3, then this traffic must be routed at Layer 3, as these hosts are in different VLANs. The regular laws of networking would apply in this case and Host-1 would need to resolve the ARP for its gateway. The same concept would apply if Host-6 wishes to send some data to Host-3. Since none of the hosts behind Node-3 ever need to resolve the MAC address of Host-3 to be able to send data to it, there is no need for Node-2 to advertise Host-3’s MAC address to Node-3. However, this would change if a new host in VLAN-20 is connected behind Node-3.

Conclusion: if a Node learns of a MAC address in a specific VLAN, then this MAC address should be sent over to all the other Nodes that have an active port in that particular VLAN. Note that this communication of letting other Nodes know about a certain MAC address would be a part of the internal Control Plane traffic within the QFabric system. This data will not be sent out to devices that are connected to the Nodes of the QFabric system. Hence, for Layer 2 routes, the factor that decides whether a Node gets that route or not is the VLAN.

Layer 2 BUM Traffic

Let’s consider that Host-4 sends out Layer 2 broadcast traffic, that is, frames in which the destination MAC address is ff:ff:ff:ff:ff:ff and that all this traffic should be flooded in VLAN-30. In the QFabric system depicted in Figure 3.4, there are three Nodes that have active ports in VLAN-30: Node-2, Node-3, and Node-4. What happens?

� All the broadcast traffic originated by the Host-4 should be sent internally to Node-3 and Node-4 and then these Nodes should be able to flood this traffic in VLAN-30.

� Since Node-1 doesn’t have any active ports in VLAN-30, it doesn’t need to flood this traffic out of any revenue ports or server facing ports. This means that Node-2 should not send this traffic over to Node-1. However, at a later time, if Node-1 gets an active port in VLAN-30 then the broadcast traffic will be sent to Node-1 as well.

� These points are true for BUM traffic assuming that IGMP snooping is disabled.

Page 48: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

46 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

In conclusion, if a Node receives BUM traffic in a VLAN, then all that traffic should be sent over to all the other Nodes that have an active port in that VLAN and not to those Nodes that do not have any active ports in this VLAN.

Layer3Routes

Layer 3 routes are good old unicast IPv4 routes. Note that only the NW-NG-VM has the ability to run Layer 3 protocols with externally connected devices, hence at any given time, the active NW-NG-VM has all the Layer 3 routes learned in all the routing instances that are configured on a given QFabric system. However, not all these routes are sent to the PFE of all the Nodes within an NW-NG.

Let’s use Figure 3.5, which represents a Network Node Group, for the discussion of Layer 3 unicast routes. All the Nodes shown are the part of NW-NG. Host-1 is connected to Node-1, Host-2 and Host-3 are connected to Node-2, and Host-4 is connected to Node-3. You can see that all the IP addresses and the subnets are shown as well. Additionally, the subnets for Host-1 and Host-2 are in routing instance RED, whereas the subnets for Host-3 and Host-4 are in routing instance BLUE. The default gateways for these hosts are the Routed VLAN Interfaces (RVIs) that are configured and shown in the diagram in Figure 3.5.

Let’s assume that there are hosts and devices connected to all three Nodes in the default (master) routing instance, although not shown in the diagram. The case of IPv4 routes is much simpler than Layer 2 routes. Basically, it’s the routing instance that decides if a route should be sent to other REs or not.

Figure 3.5 Network Node Group and Connected Hosts

Page 49: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter3:ControlPlaneandDataPlaneFlows 47

In the QFabric configuration shown in Figure 3.5, the following takes place:

� Node-1 and Node-2 have one device each connected in routing instance RED. The default gateway (interface vlan.100) for these devices resides on the active NW-NG-VM, meaning that the NW-NG-VM has two direct routes in this routing instance, one for the subnet 1.1.1.0/24 and the other for 2.2.2.0/24.

� Since the route propagation deciding factor for Layer 3 routes is the routing instance, the active NW-NG-VM sends the route of 1.1.1.0/24 and 2.2.2.0/24s to both Node-1 and Node-2 so that these routes can be programmed in the data plane (PFE of the Nodes).

� The active NW-NG-VM will not send the information about the directly connected routes in routing instance BLUE over to Node-1 at all. This is because Node-1 doesn’t have any directly connected devices in the BLUE routing instance.

� This is true for all kinds of routes learned within a routing instance; they could either be directly connected, static, or learned via routing protocols like BGP, OSPF, or IS-IS.

� All of the above applies to Node-2 and Node-3 for the routing instance named BLUE.

� All of the above applies to SNGs, RSNGs, and the master routing instance.

In conclusion, the route learning always takes place at the active NW-NG-VM and only selective routes are propagated to the individual Nodes for programming the data plane (PFE of the Nodes). The individual Nodes get the Layer 3 routes from the active NW-NG-VM only if the Node has an active port in that routing instance. This concept of sending routes to an RE/PFE only if it needs that route ensures that you do not send all the routes everywhere. That’s the high scale at which a QFabric system can operate. Now let’s discuss how are those routes are sent over to different Nodes.

Distributing Routes to Different Nodes

The QFabric technology uses the concept of Layer 3 MPLS-VPN (RFC-2547) internally to make sure that a Node gets only the routes that it needs. RFC2547 introduced the concept of Route Distinguishers (RD) Route Targets (RT).

QFabric technology also uses the same concept to make sure that a route gets only the routes that it needs. Let’s again review the different kind of routes in Table 3.1.

Table 3.1 Layer 2 and Layer 3 Route Comparsion

Layer 2 Routes Layer 3 Routes

Deciding factor is the VLAN. Deciding factor is the routing instance.

Internally, each VLAN is assigned a token. Internally, each routing instance is assigned a token.

This token acts as the RD/RT and acts as the deciding factor about whether a route should be sent to a Node or not.

This token acts as the RD/RT and acts as the deciding factor about whether a route should be sent to a Node or not.

Page 50: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

48 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

Each active RE within the QFabric system forms a BGP peering with the VMs called FC-0 and FC-1. All the active REs send all Layer 2 and Layer 3 routes over to the FC-0 and FC-1 VMs via BGP. These VMs only send the appropriate routes over to individual REs (only the routes that the REs need).

The FC-0 and FC-1 VMs act as route reflectors. However, these VMs follow the rules of QFabric technology when it comes to deciding which routes to be sent to which RE (not sending the routes that an RE doesn’t need).

Figure 3.6 shows all the components (SNG, RSNG, and NW-NG VMs) sending all of their learned routes (Layer 2 and Layer 3) over to the Fabric Control VM.

Figure 3.6 Different REs Send Their Learned Routes to the Fabric Control VM

However, the Fabric Control VM sends only those routes to a component that are relevant to it. In Figure 3.7, the different colored arrows signify that the relevant routes that the Fabric Control VMs send to each component may be different.

Let’s look at some show command snippets that will demonstrate how a local route gets sent to the FC, and then how the other Nodes see it. They will be separated into Layer 2 and Layer 3 routes, and most of the snippets have bolded notes preceded by <<.

Page 51: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter3:ControlPlaneandDataPlaneFlows 49

Figure 3.7 Fabric Control VM Sends Only Relevant Routes to the Individual REs

ShowCommandSnippetsforLayer2Routes

In the following example, the MAC address of ac:4b:c8:f8:68:97 is learned on MLRSNG01a:xe-0/0/8.0:

root@TEST-QFABRIC# run show ethernet-switching table vlan 709 << from the QFabric CLIEthernet-switching table: 6 unicast entries VLAN MAC address Type Age Interfaces V709 * Flood - NW-NG-0:All-members RSNG0:All-members V709 00:00:5e:00:01:01 Learn 0 NW-NG-0:ae0.0 V709 3c:94:d5:44:dd:c1 Learn 2:17 NW-NG-0:ae34.0 V709 40:b4:f0:73:42:01 Learn 2:25 NW-NG-0:ae36.0 V709 40:b4:f0:73:9e:81 Learn 2:06 NW-NG-0:ae38.0 V709 ac:4b:c8:83:b7:f0 Learn 2:04 NW-NG-0:ae0.0 V709 ac:4b:c8:f8:68:97 Learn 0 MLRSNG01a:xe-0/0/8.0[edit]root@TEST-QFABRIC#

Here the Node named MLRSNG01a is a member of the RSNG named RSNG0:

root@TEST-QFABRIC# run show fabric administration inventory node-groups RSNG0Item Identifier Connection ConfigurationNode group RSNG0 Connected Configured MLRSNG01a P6810-C Connected MLRSNG02a P7122-C Connected[edit]

This is what the local route looks like on the RSNG:

qfabric-admin@RSNG0> show ethernet-switching table | match ac:4b:c8:f8:68:97 V709---qfabric ac:4b:c8:f8:68:97 Learn 43 xe-0/0/8.0 << learned on fpc0{master}qfabric-admin@RSNG0> show virtual-chassis

Page 52: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

50 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

Preprovisioned Virtual ChassisVirtual Chassis ID: 0000.0103.0000 MstrMember ID Status Model prio Role Serial No0 (FPC 0) Prsnt qfx3500 128 Master* P6810-C << MLRSNG01a or fpc0 1 (FPC 1) Prsnt qfx3500 128 Backup P7122-C{master}qfabric-admin@RSNG0>

Here are some more details about the RSNG:

qfabric-admin@RSNG0> show fabric summaryAutonomous System : 100INE Id : 128.0.130.6 << the local INE-idINE Type : ServerSimulation Mode : SI{master}

Hardware token assigned to Vlan.709 is 12:

qfabric-admin@RSNG0> start shell% vlaninfoIndex Name Inst Tag Flags HW-Token L3-ifl MST Index2 default 0 0 0x100 3 0 2543 V650---qfabric 0 650 0x100 8 0 2544 V709---qfabric 0 709 0x100 12 0 2545 V2283---qfabric 0 2283 0x100 19 0 254

The hardware token for a VLAN can be obtained using the CLI using the following commands:

qfabric-admin@RSNG0> show vlans V709---qfabric extensiveVLAN: V709---qfabric, Created at: Thu Nov 14 05:39:28 2013802.1Q Tag: 709, Internal index: 4, Admin State: Enabled, Origin: StaticProtocol: Port Mode, Mac aging time: 300 secondsNumber of interfaces: Tagged 0 (Active = 0), Untagged 0 (Active = 0){master}qfabric-admin@RSNG0> show fabric vlan-domain-map vlan 4Vlan L2Domain L3-Ifl L3-Domain4 12 0 0{master}qfabric-admin@RSNG0>

The Layer 2 domain shown in the output of show fabric vlan-domain-map vlan <internal-index> contains the same value as that of the hardware token of the VLAN and it’s also called the L2Domain-Id for a particular VLAN.

As discussed earlier, this route is sent over to the FC-VM. This is how the route looks on the FC-VM (note that the FC-VM uses a unique table called bgp.brid-gevpn.0 :

qfabric-admin@FC-0> show route fabric table bgp.bridgevpn.0--snip--65534:1:12.ac:4b:c8:f8:68:97/152 *[BGP/170] 6d 07:42:56, localpref 100 AS path: I, validation-state: unverified > to 128.0.130.6 via dcfabric.0, Push 1719, Push 10, Push 25(top) [BGP/170] 6d 07:42:56, localpref 100, from 128.0.128.8 AS path: I, validation-state: unverified > to 128.0.130.6 via dcfabric.0, Push 1719, Push 10, Push 25(top)

Page 53: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter3:ControlPlaneandDataPlaneFlows 51

So, the next hop for this route is being shown as 128.0.130.6. It’s clear from the output snippets mentioned earlier, that this is the internal IP address for the RSNG. The bolded portion of the route shows the hardware token of the VLAN. The output snippets above showed that the token for VLAN.709 is 12.

The labels that are being shown in the output of the route at the FC-VM are specific to the way the FC-VM communicates with this particular RE (the RSNG). The origination and explanation of these labels is beyond the scope of this book.

As discussed earlier, a Layer 2 route should be sent across to all the Nodes that have active ports in that particular VLAN. In this specific example, here are the Nodes that have active ports in VLAN.709:

root@TEST-QFABRIC# run show vlans 709Name Tag InterfacesV709 709 MLRSNG01a:xe-0/0/8.0*, NW-NG-0:ae0.0*, NW-NG-0:ae34.0*, NW-NG-0:ae36.0*, NW-NG-0:ae38.0*[edit]

Since the NW-NG Nodes are active for VLAN 709, the active NW-NG-VM should have the Layer 2 route under discussion (“ac:4b:c8:f8:68:97” in VLAN 709) learned via the FC-VM via the internal BGP protocol. Here are the corresponding show snippets from the NW-NG-VM (note that whenever the individual REs learn Layer 2 routes from the FC, they are stored in the table named default.bridge.0):

root@TEST-QFABRIC# run request component login NW-NG-0Warning: Permanently added 'dcfnode-default---nw-ine-0,169.254.192.34' (RSA) to the list of known hosts.Password:--- JUNOS 13.1I20130618_0737_dc-builder built 2013-06-18 08:51:07 UTCAt least one package installed on this device has limited support.Run 'file show /etc/notices/unsupported.txt' for details.{master}qfabric-admin@NW-NG-0> show route fabric table default.bridge.0--snip--12.ac:4b:c8:f8:68:97/88 *[BGP/170] 1d 10:53:47, localpref 100, from 128.0.128.6 AS path: I, validation-state: unverified > to 128.0.130.6 via dcfabric.0, Layer 2 Fabric Label 1719 PFE Id 10 Port Id 25 [BGP/170] 1d 10:53:47, localpref 100, from 128.0.128.8 AS path: I, validation-state: unverified > to 128.0.130.6 via dcfabric.0, Layer 2 Fabric Label 1719 PFE Id 10 Port Id 25

The bolded portion of the snippet shows the token for VLAN.709 (12), the destina-tion PFE-ID and the Port-ID are data plane entities. This is the information that gets pushed down to the PFE of the member Nodes and then these details are used to forward data in hardware. In this example, whenever a member Node of the NW-NG gets traffic for this MAC address, it sends this data via the FTE links to the Node with PFE-ID of 10. The PFE-IDs of all the Nodes within a Node group can be seen by logging into the corresponding VM and correlating the outputs of show fabric multicast vccpdf-adjacency and show virtual chassis. In this example, it’s the RSNG that locally learns the Layer 2 route of ac:4b:c8:f8:68:97 in VLAN 709. Here are the outputs of commands that show which Node has the PFE of 10:

root@TEST-QFABRIC# run request component login RSNG0Warning: Permanently added 'dcfNode-default-rsng0,169.254.193.3' (RSA) to the list of known hosts.Password:

Page 54: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

52 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

--- JUNOS 13.1I20130618_0737_dc-builder built 2013-06-18 08:50:01 UTCAt least one package installed on this device has limited support.Run 'file show /etc/notices/unsupported.txt' for details.{master}qfabric-admin@RSNG0> show fabric multicast vccpdf-adjacencyFlags: S - StaleSrc Src Src Dest Src DestDev id INE Dev type Dev id Interface Flags Port Port9 34 TOR 256 n/a -1 -19 34 TOR 512 n/a -1 -110 259(s) TOR 256 fte-0/1/1.32768 1 310 259(s) TOR 512 fte-0/1/0.32768 0 311 34 TOR 256 n/a -1 -111 34 TOR 512 n/a -1 -112 259(s) TOR 256 fte-1/1/1.32768 1 212 259(s) TOR 512 fte-1/1/0.32768 0 2256 260 F2 9 n/a -1 -1256 260 F2 10 n/a -1 -1256 260 F2 11 n/a -1 -1256 260 F2 12 n/a -1 -1512 261 F2 9 n/a -1 -1512 261 F2 10 n/a -1 -1512 261 F2 11 n/a -1 -1512 261 F2 12 n/a -1 -1{master}

The Src Dev ID shows the PFE-IDs of the member Nodes and the Interface column shows the FTE interface that goes to the interconnects. The highlighted output shows that the device with fpc-0 has the PFE-ID of 10 (fte-0/1/1 means that the port belongs to member Node which is fpc-0).

The output of show virtual-chassis shows which device is fpc-0:

qfabric-admin@RSNG0> show virtual-chassisPreprovisioned Virtual ChassisVirtual Chassis ID: 0000.0103.0000 MstrMember ID Status Model prio Role Serial No0 (FPC 0) Prsnt qfx3500 128 Master* P6810-C1 (FPC 1) Prsnt qfx3500 128 Backup P7122-C{master}

These two snippets show that the device with fpc-0 is the Node with device ID of P6810-C. Also, the MAC address was originally learned on port xe-0/0/8 (refer to the preceeding outputs).

The last part of the data plane information on the NW-NG was the port-ID of the Node with PFE-ID = 10. The PFE-ID generation is Juniper confidential information and beyond the scope of this book. However, the port-ID shown in the output of show route fabric table default.bridge.0 would always be 17 more than the actual port-number of the ingress Node in case when QFX 3500s are being used as the Nodes. In this example, the MAC address was learned on xe-0/0/8 on the RSNG Node. This means that the port-ID being shown on the NW-NG should be 8 + 17 = 25. This is exactly the information that we saw in the output of show route fabric default.bridge.0 earlier.

Page 55: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter3:ControlPlaneandDataPlaneFlows 53

ShowCommandSnippetsforLayer3Routes

Layer 3 routes are also propagated similar to Layer 2 routes. The only difference is that the table is named bgp.l3vpn.0. As discussed, it’s the routing instance that decides whether a Layer 3 route should be sent to a Node device or not. Let’s look at the CLI snippets to verify the details:

root@TEST-QFABRIC# run request component login NW-NG-0Warning: Permanently added 'dcfnode-default---nw-ine-0,169.254.192.34' (RSA) to the list of known hosts.Password:--- JUNOS 13.1I20130618_0737_dc-builder built 2013-06-18 08:51:07 UTCAt least one package installed on this device has limited support.Run 'file show /etc/notices/unsupported.txt' for details.{master}qfabric-admin@NW-NG-0> show route protocol directinet.0: 95 destinations, 95 routes (95 active, 0 holddown, 0 hidden)Restart Complete+ = Active Route, - = Last Active, * = Both172.17.106.128/30 *[Direct/0] 6d 08:38:41 > via ae4.0 <<<<< consider this route172.17.106.132/30 *[Direct/0] 6d 08:38:38 > via ae5.0172.17.106.254/32 *[Direct/0] 6d 09:10:36 > via lo0.0{master}qfabric-admin@NW-NG-0> show configuration interfaces ae4description "NW-NG-0:ae4 P2P layer 3 to TSTRa:ae5";metadata NW-NG-0:ae4;mtu 9192;mac f8:c0:01:f9:30:0c;unit 0 { global-layer2-domainid 6; family inet { address 172.17.106.129/30; <<<<< local IP address }}{master}qfabric-admin@RSNG0> ... bgp.l3vpn.0 | find 172.17.106.13265534:1:172.17.106.132/30 *[BGP/170] 6d 08:27:07, localpref 101, from 128.0.128.6 AS path: I, validation-state: unverified to 128.0.128.4 via dcfabric.0, PFE Id 9 Port Id 17 to 128.0.128.4 via dcfabric.0, PFE Id 9 Port Id 18 > to 128.0.128.4 via dcfabric.0, PFE Id 9 Port Id 21

This information is similar to that which was seen in the case of a Layer 2 route. Since this particular route is a direct route on the NW-NG, then the IP address of 128.0.128.4 and the corresponding data plane information (PFE-ID: 9 and Port-ID: 21) should reside on the NW-NG. Here are the verification commands:

qfabric-admin@NW-NG-0> show fabric summaryAutonomous System : 100INE Id : 128.0.128.4 <<<< this is correctINE Type : NetworkSimulation Mode : SI{master}qfabric-admin@NW-NG-0> show fabric multicast vccpdf-adjacencyFlags: S - StaleSrc Src Src Dest Src Dest

Page 56: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

54 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

Dev id INE Dev type Dev id Interface Flags Port Port9 34(s) TOR 256 fte-2/1/1.32768 1 09 34(s) TOR 512 fte-2/1/0.32768 0 010 259 TOR 256 n/a -1 -110 259 TOR 512 n/a -1 -111 34(s) TOR 256 fte-1/1/1.32768 1 111 34(s) TOR 512 fte-1/1/0.32768 0 112 259 TOR 256 n/a -1 -112 259 TOR 512 n/a -1 -1256 260 F2 9 n/a -1 -1256 260 F2 10 n/a -1 -1256 260 F2 11 n/a -1 -1256 260 F2 12 n/a -1 -1512 261 F2 9 n/a -1 -1512 261 F2 10 n/a -1 -1512 261 F2 11 n/a -1 -1512 261 F2 12 n/a -1 -1

{master}

So the PFE-ID of 9 indeed resides on the NW-NG. According to the output of show route fabric table bgp.l3vpn.0 taken from the RSNG, the port-ID of the remote Node is 21. This means that the corresponding port number on the NW-NG should be xe-2/0/4 (4 + 17 = 21). Note that the original Layer 3 route was a direct route because of the configuration on ae4 on the NW-NG. Hence one should expect xe-2/0/4 to be a part of ae4. Here is what the configuration looks like on the NW-NG:

qfabric-admin@NW-NG-0> show configuration interfaces xe-2/0/4description "NW-NG-0:ae4 to TSTRa xe-4/2/2";metadata MLNNG02a:xe-0/0/4;ether-options { 802.3ad ae4;<<< this is exactly the expected information }

{master}

BUM Traffic

A QFabric system can have 4095 VLANs configured on it and can also be comprised of multiple Node groups. A Node group may or may not have any active ports in a specific VLAN. To maintain scale within a QFabric system, whenever data has to be flooded it is sent only to those Nodes which have an active port in the VLAN in question.

To make sure that flooding takes place according to these rules, the QFabric technol-ogy introduces the concept of a Multicast Core Key. A Multicast Core Key is a 7-bit value and it identifies a group of Nodes for the purposes of replicating BUM traffic. This value is always generated by the active NW-NG-0 VM and is advertised to all the Nodes so that correct replication and forwarding of BUM traffic can take place.

As discussed, a Node should receive BUM traffic in a VLAN only if it has an active port (which is in up/up status) in that given VLAN. To achieve this, whenever a Node’s interface becomes an active member of a VLAN, that Node relays this information to the NW-NG-0 VM over the CPE network. The NW-NG-0 VM processes this information from all the Nodes and generates a Multicast Core Key for that VLAN. This Multicast Core Key has an index of all the Nodes that subscribe to this VLAN (that is, the Nodes which have an active port in this VLAN). The Core Key is then advertised to all the Nodes and all the Interconnects by the NW-NG-0 VM over the CPE network. This processes is hereafter referred to as “a Node sub-scribing to the VLAN.”

Page 57: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter3:ControlPlaneandDataPlaneFlows 55

Once the Nodes and Interconnects receive this information, they install a broadcast route in their default.bridge.0 table and the next hop for this route is the Multicast Core Key number. With this information, the Nodes and Interconnects are able to send the BUM data only to Nodes that subscribe to this VLAN.

Note that there is a specific table called default.fabric.0 that contains all the infor-mation regarding the Multicast Core Keys. This includes the information that the NW-NG-0 VM receives from the Nodes when they subscribe to a VLAN.

Here is a step wise explanation of this process for vlan.29:

1. Vlan.29 is present only on the Nodes that are a part of the Network Node group:

root@TEST-QFABRIC> show vlans vlan.29Name Tag Interfacesvlan.29 29 NW-NG-0:ae0.0*, NW-NG-0:ae34.0*, NW-NG-0:ae36.0*, NW-NG-0:ae38.0*

2. The hardware token for vlan.29 is determined to be 5:

qfabric-admin@NW-NG-0> show vlans 29 extensiveVLAN: vlan.29---qfabric, Created at: Tue Dec 3 09:48:54 2013802.1Q Tag: 29, Internal index: 7, Admin State: Enabled, Origin: StaticProtocol: Port Mode, Mac aging time: 300 secondsNumber of interfaces: Tagged 4 (Active = 4), Untagged 0 (Active = 0) ae0.0*, tagged, trunk ae34.0*, tagged, trunk ae36.0*, tagged, trunk ae38.0*, tagged, trunk{master}qfabric-admin@NW-NG-0> show fabric vlan-domain-map vlan 7Vlan L2Domain L3-Ifl L3-Domain 5 0 0

3. Since vlan.29 has active ports only on the Network Node Group, this VLAN shouldn’t exist on any other Node group:

root@TEST-QFABRIC> request component login RSNG0Warning: Permanently added 'dcfnode-default-rsng0,169.254.193.3' (RSA) to the list of known hosts.Password:--- JUNOS 13.1I20130618_0737_dc-builder built 2013-06-18 08:50:01 UTCAt least one package installed on this device has limited support.Run 'file show /etc/notices/unsupported.txt' for details.{master}qfabric-admin@RSNG0> show vlans 29error: vlan with tag 29 does not exist{master}qfabric-admin@RSNG0>

4. At this point in time, the NW-NG-0’s default.fabric.0 table does not contain only local information:

qfabric-admin@NW-NG-0> show fabric summaryAutonomous System : 100INE Id : 128.0.128.4INE Type : NetworkSimulation Mode : SI{master}qfabric-admin@NW-NG-0> ...0 fabric-route-type mcast-routes l2domain-id 5default.fabric.0: 88 destinations, 92 routes (88 active, 0 holddown, 0 hidden)Restart Complete

Page 58: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

56 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

+ = Active Route, - = Last Active, * = Both5.ff:ff:ff:ff:ff:ff:128.0.128.4:128:000006c3(L2D_PORT)/184 *[Fabric/40] 11w1d 02:59:50 > to 128.0.128.4:128(NE_PORT) via ae0.0, Layer 2 Fabric Label 17315.ff:ff:ff:ff:ff:ff:128.0.128.4:162:000006d3(L2D_PORT)/184 *[Fabric/40] 6w3d 09:03:04 > to 128.0.128.4:162(NE_PORT) via ae34.0, Layer 2 Fabric Label 17475.ff:ff:ff:ff:ff:ff:128.0.128.4:164:000006d1(L2D_PORT)/184 *[Fabric/40] 11w1d 02:59:50 > to 128.0.128.4:164(NE_PORT) via ae36.0, Layer 2 Fabric Label 17455.ff:ff:ff:ff:ff:ff:128.0.128.4:166:000006d5(L2D_PORT)/184 *[Fabric/40] 11w1d 02:59:50 > to 128.0.128.4:166(NE_PORT) via ae38.0, Layer 2 Fabric Label 1749{master}

The command executed above is show route fabric table default.fabric.0 fabric-route-type mcast-routes l2domain-id 5.

5. The user configures a port on the Node group named RSNG0 in vlan.29. After this, RSNG0 started displaying the details for vlan.29:

root@TEST-QFABRIC# ...hernet-switching port-mode trunk vlan members 29[edit]root@TEST-QFABRIC# commitcommit complete[edit]root@TEST-QFABRIC# show | compare rollback 1[edit interfaces]+ P7122-C:xe-0/0/9 {+ unit 0 {+ family ethernet-switching {+ port-mode trunk;+ vlan {+ members 29;+ }+ }+ }+ }[edit]qfabric-admin@RSNG0> show vlans 29 extensiveVLAN: vlan.29---qfabric, Created at: Thu Feb 27 13:16:39 2014802.1Q Tag: 29, Internal index: 7, Admin State: Enabled, Origin: StaticProtocol: Port Mode, Mac aging time: 300 secondsNumber of interfaces: Tagged 1 (Active = 1), Untagged 0 (Active = 0) xe-1/0/9.0*, tagged, trunk{master}

� 6. Since RSNG0 is now subscribing to vlan.29, this information should be sent over to the NW-NG-0 VM. Here is what the default.fabric.0 table of contents looks like at RSNG0:

qfabric-admin@RSNG0> show fabric summaryAutonomous System : 100INE Id : 128.0.130.6INE Type : ServerSimulation Mode : SI{master}qfabric-admin@RSNG0> ...c.0 fabric-route-type mcast-routes l2domain-id 5default.fabric.0: 30 destinations, 54 routes (30 active, 0 holddown, 0 hidden)Restart Complete+ = Active Route, - = Last Active, * = Both5.ff:ff:ff:ff:ff:ff:128.0.130.6:49174:000006c1(L2D_PORT)/184 *[Fabric/40] 00:04:59

Page 59: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter3:ControlPlaneandDataPlaneFlows 57

> to 128.0.130.6:49174(NE_PORT) via xe-1/0/9.0, Layer 2 Fabric Label 1729{master}

This route is then sent over to NW-NG-0 via the Fabric Control VM.

7. NW-NG-0 receives the route from RSNG0 and updates its default.fabric.0 table:

qfabric-admin@NW-NG-0> ...ic-route-type mcast-routes l2domain-id 5--snip—5.ff:ff:ff:ff:ff:ff:128.0.130.6:49174:000006c1(L2D_PORT)/184 *[BGP/170] 00:07:34, localpref 100, from 128.0.128.6 <<<< 128.0.128.6 is RSNG0’s IP address AS path: I, validation-state: unverified > to 128.0.130.6 via dcfabric.0, Layer 2 Fabric Label 1729 PFE Id 12 Port Id 26 [BGP/170] 00:07:34, localpref 100, from 128.0.128.8 AS path: I, validation-state: unverified > to 128.0.130.6 via dcfabric.0, Layer 2 Fabric Label 1729 PFE Id 12 Port Id 26

8. NW-NG-0 VM checks its database to find out the list of Nodes that already subscribe to vlan.29 and generates a PFE-map. This PFE-map contains the indices of all the Nodes that subscribe to vlan.29:

qfabric-admin@NW-NG-0> show fabric multicast root vlan-group-pfe-map L2 domain Group Flag PFE map Mrouter PFE map 2 2.255.255.255.255 6 1A00/3 0/0 5 5.255.255.255.255 6 1A00/3 0/0--snip--

Check the entry corresponding to the L2Domain-ID for the corresponding VLAN. In this case, the L2Domain-ID for vlan.29 is 5.

9. The NW-NG-0 VM creates a Multicast Core-Key for the PFE-map (4101 in this case):

qfabric-admin@NW-NG-0> ...multicast root layer2-group-membership-entriesGroup Membership Entries: --snip-- L2 domain: 5 Group:Source: 5.255.255.255.255 Multicast key: 4101 Packet Forwarding map: 1A00/3 --snip--

The command used here was show fabric multicast root layer2-group-member-ship-entries. This command is only available in Junos 13.1 and higher. In earlier versions of Junos the show fabric multicast root map-to-core-key command can be used to obtain the Multicast Core Key number.

10. The NW-NG-0 VM sends this Multicast Core Key to all the Nodes and Interconnects via the Fabric Control VM. This information is placed in the default.fabric.0 table. Note that this table is only used to store the Core Key information and is not used to actually forward data traffic. Note that the next hop is 128.0.128.4, which is the IP address of the NW-NG-0 VM.

qfabric-admin@RSNG0> show route fabric table default.fabric-route-type mcast-member-map-key 4101default.fabric.0: 30 destinations, 54 routes (30 active, 0 holddown, 0 hidden)Restart Complete+ = Active Route, - = Last Active, * = Both4101:7(L2MCAST_MBR_MAP)/184 *[BGP/170] 00:35:12, localpref 100, from 128.0.128.6 AS path: I, validation-state: unverified > to 128.0.128.4 via dcfabric.0, PFE Id 27 Port Id 27 [BGP/170] 00:35:12, localpref 100, from 128.0.128.8 AS path: I, validation-state: unverified > to 128.0.128.4 via dcfabric.0, PFE Id 27 Port Id 27

Page 60: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

58 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

11. The NW-NG-0 VM sends out a broadcast route for the corresponding VLAN to all the Nodes and the Interconnects. The next hop for this route is set to the Multicast Core Key number. This route is placed in the default.bridge.0 table and is used to forward and flood the data traffic. The Nodes and Interconnects will install this route only if they have information for the Multicast Core Key in their default.fabric.0 table. In this example, note that the next hop contains the information for the Multicast Core Key as well:

qfabric-admin@RSNG0> show route fabric table default.bridge.0 l2domain-id 5--snip--5.ff:ff:ff:ff:ff:ff/88 *[BGP/170] 00:38:13, localpref 100, from 128.0.128.6 AS path: I, validation-state: unverified > to 128.0.128.4:57005(NE_PORT) via dcfabric.0, MultiCast - Corekey:4101 Keylen:7 [BGP/170] 00:38:13, localpref 100, from 128.0.128.8 AS path: I, validation-state: unverified > to 128.0.128.4:57005(NE_PORT) via dcfabric.0, MultiCast - Corekey:4101 Keylen:7

The eleven steps mentioned here are a deep dive into how the Nodes of a QFabric system subscribe to a given VLAN. The aim of this technology is to make sure that all the Nodes and the Interconnects have consistent information regarding which Nodes subscribe to a specific VLAN. This information is critical to ensuring that there is no excessive flooding within the data plane of a QFabric system.

At any point in time, there may be multiple Nodes that subscribe to a VLAN, raising the question of where a QFabric system should replicate BUM traffic. QFabric systems replicate BUM traffic at the following places:

� Ingress Node: Replication takes place only if:

� There are any local ports in the VLAN where BUM traffic was received. BUM traffic is replicated and sent out on the server facing ports.

� There are any remote Nodes that subscribe to the VLAN in question. BUM traffic is replicated and sent out towards these specific Nodes over the 40GbE FTE ports.

� Interconnects: Replication takes place if there are any directly connected Nodes that subscribe to the given VLAN.

� Egress Node: Replication takes place only if there are any local ports that are active in the given VLAN.

Differences Between Control Plane Traffic and Internal Control Plane Traffic

Most of this chapter has discussed the various control plane characteristics of the QFabric system and how the routes are propagated from one RE to another. With this background, note the following functions that a QFabric system has to perform to operate:

Page 61: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter3:ControlPlaneandDataPlaneFlows 59

� Control plane tasks: form adjacencies with other networking devices, learn Layer 2 and Layer 3 routes

� Data plane tasks: forward data end-to-end

� Internal control plane tasks: discover Nodes and Interconnects, maintain VCCPD, VCCPDf adjacencies, health of VMs, exchange routes within the QFabric system to enable communication between hosts connected on differ-ent Nodes

The third bullet here makes QFabric a special system. All the control plane traffic that is used for the internal workings of the QFabric system is referred to as internal control plane traffic. And the last pages of this chapter are dedicated to bringing out the differences between the control plane and the internal control plane traffic. Let’s consider the following QFabric system shown in Figure 3.8.

Figure 3.8 Sample QFabric System and Connected Hosts

In Figure 3.8, the data plane is shown using blue lines and the CPE is shown using green lines. There are four Nodes, two Interconnects, and four Hosts. Host-1 and Host-2 are in the RED-VLAN (vlan.100), Host-3 and Host-4 are in YELLOW-VLAN (vlan.200). Node-1, Node-2, and Node-3 are SNGs, whereas Node-4 is an NW-NG Node and has Host-4, as well as a router (R1), directly connected to it. Finally, BGP is running between the QFabric and R1.

Page 62: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

60 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

Let’s discuss the following traffic profiles: Internal Control Plane and Control Plane.

Internal Control Plane

� VCCPDf Hellos between the Nodes and the Interconnects are an example of internal control plane traffic.

� Similarly, the BGP sessions between the Node groups and the FC-VM is also an example of internal control plane traffic.

� Note that the internal control plane traffic is also generated by the CPU, but it’s used for forming and maintaining the states of protocols that are critical to the inner-workings of a QFabric system.

� Also, the internal control plane traffic is always used only within the QFabric system. The internal control plane traffic will never be sent out of any Nodes.

Control Plane

� Start a ping from Host-1 to its default gateway. Note that the default-gateway for Host-1 resides on the QFabric. In order for the ICMP pings to be success-ful, Host-1 will need to resolve ARP for the gateway’s IP address. Note that Host-1 is connected to Node-1, which is an SNG, and the RE functionality is always locally active on an SNG. Hence the ARP replies will be generated locally by Node-1’s CPU (SNG). The ARP replies are sent out to Host-1 using the data plane on Node-1.

� The BGP Hellos between the QFabric system and R1 will be generated by the active NW-NG-VM, which is running on the DGs. Even though R1 is directly connected to Node-4, the RE functionality on Node-4 is disabled because it is a part of the Network Node Group. The BGP Hellos are sent out to R1 using the data plane link between Node-4 and R1.

� The control plane traffic is always between the QFabric system and an exter-nal entity. This means that the control plane traffic eventually crosses the data plane, too, and goes out of the QFabric system via some Node(s).

To be an effective QFabric administrator, it is extremely important to know which RE/PFE would be active for a particular abstraction or Node group. Note that all the control plane traffic for a particular Node group is always originated by the active RE for that Node group. This control plane traffic is responsible for forming and maintaining peerings and neighborships with external devices. Here are some specific examples:

� A server connected to SNG (active-RE is the Node’s RE) via a single link: In this situation, if LLDP is running between the server and the QFabric, then the RE of the Node is responsible for discovering the server via LLDP. The LLDP PDUs will be generated by the RE of the Node, which will help the server with the discovery of the QFabric system.

� RSNG: For an RSNG, the REs of the Nodes have an active/passive relation-ship. For example:

� A server with two NICs connected to each Node of an RSNG: This is the classic use case for an RSNG for directly connecting servers to a QFabric system. Now if LACP is running between the server and the QFabric system,

Page 63: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter3:ControlPlaneandDataPlaneFlows 61

then the active RE is responsible for exchanging LACP PDUs with the server to make sure that the aggregate link stays up.

� An access switch is connected to each Node of an RSNG: This is a popular use case in which the RSNG (QFabric) acts as an aggregation point. However, you can eliminate STP by connecting the access switch to each Node of the RSNG and by running LACP on the aggregate port. This leads to a flat network design. Again the active RE of the RSNG is responsible for exchang-ing LACP PDUs with the server to make sure that the aggregate link stays up.

� Running BGP between NW-NG-0 and an MX (external router): As expected, it’s the responsibility of the active NW-NG-0 VM (located on the active-DG) to make sure that the necessary BGP communication (keep alives, updates, etc.) takes place with the external router.

Page 64: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

62 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

Test Your Knowledge

Q: What is the difference between a Node group and a Node device?

� A Node device can be any QFX3500 or QFX3600 Node that is a part of the QFabric system. A Node group is an abstraction and is a group of Node devices.

Q: What are the different kinds of Node groups and how many Node devices can be a part of each group?

� i) Server Node Group (SNG): Only one Node device can be a part of a SNG.

� ii) Redundant Server Node Group (RSNG): an RSNG consists of two Node devices.

� iii) Network Node Group (NNG): The NNG can consist of up to eight Node devices.

Q: Can a Node device be part of multiple Node groups at the same time?

� No.

Q: Where are the active/backup Routing Engines present for the various Node groups?

� i) SNG: The Routing Engine of the Node device is active. Since the Node group consists of only one Node device, there is no backup Routing Engine.

� ii) RSNG: The Routing Engine of one Node device is active and the Routing Engine of the other Node device is backup.

� iii) NNG: The Routing Engines of the Node devices are disabled. The Routing Engine functionality is handled by two VMs running on the Director devices. These VMs operate in active/backup fashion.

Q: Are all the routes learned on a Node group’s Routing Engine sent to all other Node devices for PFE programming?

� No. Routes are sent only to the Node devices that need them. This decision is different for Layer 2 and Layer 3 routes.

� i) Layer 2 routes: Routes are sent only to those Node devices that have an active port in that VLAN.

� ii) Layer 3 routes: Routes are sent only to those Node devices that have an active port in that routing instance.

Q: Which tables contain the Layer 2 and Layer 3 routes that get propagated inter-nally between the components of a QFabric system?

� i) Layer 2 routes: bgp.bridgevpn.0

� ii) Layer 3 routes: bgp.l3vpn.0

� iii) Multicast routes: default.fabric.0

Page 65: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter 4

Data Plane Forwarding

ARP Resolution for End-to-End Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Layer 2 Traffic (Known Destination MAC Address with Source and Destination Connected on the Same Node) . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Layer 2 Traffic (Known Destination MAC Address with Source and Destination Connected on Different Nodes) . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Layer 2 Traffic (BUM Traffic) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71

Layer 3 Traffic (Destination Prefix is Learned on the Local Node) . . . . . . . 72

Layer 3 Traffic (Destination Prefix is Learned on a Remote Node) . . . . . . 73

End-to-End Ping Between Two Hosts Connected to Different Nodes . . 73

External Device Forming a BGP Adjacency . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Test Your Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Page 66: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

64 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

This chapter concerns how a QFabric system forwards data. While the previous chapters in this book have explained how to verify the working of the control plane of a QFabric system, this chapter focuses only on the data plane.

The following phrases are used in this chapter:

� Ingress Node: the Node to which the source of a traffic stream is connected.

� Egress Node: the Node to which the destination of a traffic stream is connected.

NOTE The data plane on the Nodes and the Interconnects resides on the ASIC chip. Access-ing and troubleshooting the ASIC is Juniper confidential and beyond the scope of this book.

This chapter covers the packet paths for the following kinds of traffic within a QFabric system:

� ARP resolution at the QFabric for end-to-end traffic

� Layer 2 traffic (known destination MAC address with source and destination connected on the same Node)

� Layer 2 traffic (known destination MAC address with source and destination connected on different Nodes)

� Layer 2 traffic (BUM traffic)

� Layer 3 traffic (destination prefix is learned on the local Node)

� Layer 3 traffic (destination prefix is learned on a remote Node)

� Example with an end-to-end ping between two hosts connected to different Nodes on the QFabric

ARP Resolution for End-to-End Traffic

Consider the following communication between Host-A and Host-B:

Host-A-------Juniper EX Series Switch----------Host-B

Host-A is in VLAN-1 and has an IP address of 1.1.1.100/24, Host-B is in VLAN-2 and has an IP address of 2.2.2.100/24. The RVIs for these VLANs are present on the Juniper EX Series switch and these RVIs are also the default gateways for both the VLANs.

According to the basic laws of networking, when Host-A sends some traffic to Host-B, the switch will need to resolve the destination’s ARP. This is fairly simple but things change when the intermediate device is a QFabric system.

As discussed in the previous chapters, a QFabric system scales by not sending all the routes to all the Nodes. As a result, when a Node doesn’t have any active ports in a given VLAN, that VLAN’s broadcast tree doesn’t exist locally. In simpler terms, if there is no active port for a VLAN on a Node, then that VLAN does not exist on the Node.

ARP resolutions are based on broadcasting ARP requests in a VLAN and relying on the correct host to send a response. Since the QFabric architecture allows for a situation in which a VLAN might not exist on a Node (when there are no active ports in that VLAN on the Node), there could be a situation in which a Node receives traffic that requires ARP resolution, but the destination VLAN (like VLAN-2 in the

Page 67: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter4:DataPlaneForwarding 65

just mentioned example) does not exist on the ingress Node.

Let’s look at two scenarios about the ingress Node:

� The ingress Node has the destination VLAN configured on it.

� The ingress Node doesn’t have any active ports in the destination VLAN and hence the destination VLAN doesn’t exist locally.

TheIngressNodeHastheDestinationVLANConfiguredOnIt

Figure 4.1 depicts the Nodes of a QFabric system and the Hosts connected to those Nodes. Let’s assume all the connected Host ports are configured as access ports on the QFabric, and the corresponding VLAN shown in the Figure. In this case, Host-A starts sending data to Host-B. Node-1 is the ingress Node and also has an active port in the destination VLAN. According to the standards maintained in this book, the blue links on the Nodes are the 40GbE FTE links going to the Interconnects.

Figure 4.1 A QFabric Unit’s Nodes and Connected Hosts

Think of QFabric as a large switch. When a regular switch needs to resolve ARP, then it is required to flood an ARP request in the corresponding VLAN. The QFab-ric should behave in a similar way. In this particular example, the QFabric should send out the ARP request for Host-B on all the ports that have VLAN-2 enabled. There are three such ports: one locally on Node-1, and one each on Node-2 and Node-3.

Since Node-1 has VLAN-2 active locally, it would also have the information about VLAN-2’s broadcast tree. Whenever a Node needs to send BUM traffic on a VLAN that is active locally as well, that traffic is always sent out on the broadcast tree.

Page 68: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

66 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

In this example, Node-1, Node-2, and Node-3 subscribe to the broadcast tree for VLAN-2. Hence, this request is sent out of the FTE ports towards Node-2 and Node-3. Once these broadcast frames (ARP requests) reach Node-2, they are flooded locally on all the ports that are active in VLAN-2.

Here is the sequence of steps:

1. Host-A wants to send some data to Host-B. Host-A is in VLAN-1 and Host-B is in VLAN-2. The IP and MAC addresses are shown in Figure 4.2.

2. Host-A sends this data to the default gateway (the QFabric’s RVI for VLAN-1).

3. The QFabric system needs to generate an ARP request for Host-B’s IP address.

4. Since the destination prefix (VLAN) is also active locally, generate the ARP request locally on the ingress Node (Node-1).

Figure 4.2 Node-1 Sends the Request in VLAN-2

Node-1 sends out this ARP request on the local ports that are active for VLAN-2. In addition, Node-1 consults its VLAN broadcast tree and finds out that Node-2 and Node-3 also subscribe to VLAN-2’s broadcast tree. Node-1 sends out the ARP request over the FTE links. An extra header called the fabric header is added on all the traffic going on the FTE links to make sure that only Node-2 and Node-3 receive this ARP request.

The IC receives this ARP request from Node-1. The IC looks at the header appended by Node-1 and finds out that this traffic should be sent only to Node-2 and Node-3. The IC has no knowledge of the kind of traffic that is encapsulated within the fabric header.

In Figure 4.3, Node-2 and Node-3 receive one ARP request each from their FTE links. These Nodes flood the request on all ports that are active in VLAN-2.

Host-B replies to the ARP request.

Page 69: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter4:DataPlaneForwarding 67

Figure 4.3 Node-2 and Node-3 Receive Request from the Data Plane (Interconnect)

At this point in time, the QFabric system learns the ARP entry (ARP route) for Host-2 (see Figure 4.4). Using the Fabric Control VM, this route is advertised to all the Nodes that have active ports in this VRF. Note that the ARP route will be advertised to relevant Nodes based on the same criteria as regular Layer 3 routes, that is, based on the VRF.

Figure 4.4 FC’s Role With Learning/Flooding the ARP Route

The sequence of events that takes place in Figure 4.4 is:

Page 70: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

68 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

1. Node-1 now knows how to resolve the ARP for Host-B’s IP address. This is the only information that Node-1 needs to be able to send traffic to Host-B.

2. Host-A’s data is successfully sent over to Host-B via the QFabric system.

TheIngressNodeDoesNotHavetheDestinationVLANConfiguredOnIt

Refer to Figures 4.5 – 4.7, in this case, Host-A starts sending data to Host-E. Node-1 is the ingress Node and it does not have any active port in the destination VLAN (VLAN-3).

This is a special case in which you need additional steps to make end-to-end traffic work properly. That’s because the ingress Node doesn’t have the destination VLAN and hence doesn’t subscribe to that VLAN’s broadcast tree. Since Node-1 doesn’t subscribe to destination VLAN’s broadcast tree, it has no way to know which Nodes should receive BUM traffic in that VLAN.

Note that the Network Node Group is the abstraction that holds most of the routing functionality of the QFabric. Hence, you’ll need to make use of the Network Node Group to resolve ARP in such a scenario.

Here is the sequence of steps that will take place:

1. Host-A wants to send some data to Host-E. Host-A is in VLAN-1 and Host-E is in VLAN-3. The IP and MAC addresses are shown in Figure 4.2.

2. Host-A sends this data to the default gateway (the QFabric’s RVI for VLAN-1).

3. The QFabric system needs to generate an ARP request for Host-E’s IP address.

4. Since the destination-prefix (VLAN) is not active locally, Node-1 has no way of knowing where to send the ARP request in VLAN-3. Because of this, Node-1 cannot generate the ARP request locally.

5. Node-1 is aware that Node-3 belongs to the Network Node Group. Since the NW-NG hosts the routing-functionality of a QFabric, Node-1 must send this data to the NW-NG for further processing.

Figure 4.5 Node-1 Encapsulates the Data with Fabric Header and Sends It to the Interconnects

Page 71: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter4:DataPlaneForwarding 69

6. Node-1 encapsulates the data received from Host-A with a fabric-header and sends it over to the NW-NG Nodes (Node-3 in this example).

Figure 4.6 Node-3 Sends the ARP Request to the NW-NG-0 VM for Processing

7. Node-3 receives the data from Node-1 and immediately knows that ARP must be resolved. Since resolving ARP is a Control plane function, this packet is sent over to the active NW-NG-VM. Since the VM resides on the active DG, this packet is now sent over the CPE links so that it reaches the active NW-NG-VM.

Figure 4.7 NW-NG-0 VM Generates the ARP Request and Sends It Towards the Nodes

Page 72: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

70 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

8. The active NW-NG-VM does a lookup and knows that ARP-requests need to be generated for Host-B’s IP address. The ARP request is generated locally. Note that this part of the process takes place on the NW-NG-VM that is located on the master-DG. However, the ARP request must be sent out by the Nodes so that it can reach the correct host. For this to happen, the active NW-NG-VM sends out one copy of the ARP request to each Node that is active for the destination VLAN. The ARP requests from the active NW-NG-VM are sent out on the CPE network.

9. In this specific example (the QFabric system depicted in Figure 4.2), there is only one Node that has VLAN-3 configured on it: Node-4. As a result, the NW-NG VM sends the ARP request only to Node-4. This Node receives the ARP request on its CPE links and floods it locally in VLAN-3. This is how the ARP request reaches the correct destination host.

10. Host-B replies to the ARP request.

11. At this point in time, the QFabric system learns the ARP entry (ARP route) for Host-2. Using the Fabric Control VM, this route is advertised to all the Nodes that have active ports in this VRF. This is the same process that was discussed in section 4.2.

12. Since the QFabric knows how to resolve the ARP for Host-B’s IP address, Host-A’s data is successfully sent to Host-B via the QFabric system.

Layer 2 Traffic (Known Destination MAC Address with Source and Destination Connected on the Same Node)

This is the simplest traffic forwarding case wherein the traffic is purely Layer 2 and both the source and destination are connected to the same Node.

In this scenario, the Node acts as a regular standalone switch as far as data plane forwarding is concerned. Note that QFabric will need to learn MAC addresses in order to forward the Layer 2 traffic as unicast. Once the active RE for the ingress Node group learns the MAC address, it will interact with Fabric Control VM and send that MAC address to all the other Nodes that are active in that VLAN.

Layer 2 Traffic (Known Destination MAC Address with Source and Destination Connected on Different Nodes)

In this scenario, refer again to Figure 4.2, where Host-C wants to send some data to Host-B. Note that they are both in VLAN-2 and hence the communication between them would be purely Layer 2 from QFabric’s perspective. Node-1 is the ingress Node and Node-2 is the egress Node. Since the MAC address of Host-B is already known to QFabric , the traffic from Host-C to Host-B will be forwarded as unicast by the QFabric system.

Here is the sequence of steps that will take place:

1. Node-1 receives data from Host-C and looks up the Ethernet-header. The destination-MAC address is that of Host-B. This MAC address is already learned by the QFabric system.

2. At Node-1, this MAC address would be present in the default.bridge.0 table.

3. The next-hop for this MAC address would point to Node-2.

Page 73: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter4:DataPlaneForwarding 71

4. Node-1 adds the fabric-header on this data and sends the traffic out on its FTE link. The fabric header contains the PFE-id of Node-2.

5. The IC receives this information and does a lookup on the fabric-header. This reveals that the data should be sent towards Node-2. The IC then sends the data towards Node-2.

6. Node-2 receives this traffic on its FTE link. The fabric-header is removed and a lookup is done on the Ethernet-header.

7. The destination-MAC is learned locally and points to the interface connected to Host-B.

8. Traffic is sent out towards Host-B.

Layer 2 Traffic (BUM Traffic)

BUM (Broadcast, Unknown-unicast, and Multicast) traffic always requires flooding within the VLAN (note that multicast will not be flooded within the VLAN when IGMP snooping is turned on).

Given the unique architecture of QFabric, forwarding the BUM traffic requires special handling. This is because there can always be multiple Nodes which are active for a given VLAN. Whenever a VLAN is activated on a Node for the first time (this can be done by either bringing up an access port or by adding the VLAN to an already existing trunk port), it’s the responsibility of this Node group’s active RE to make sure that it subscribes to the broadcast tree for that VLAN. (The full descrip-tion of broadcast trees was discussed in Chapter 3.)

With this in mind, let’s look at the data plane characteristics for forwarding BUM traffic. Since BUM traffic forwarding is based on the Multicast Core Key, whenever a Node gets BUM traffic in a VLAN, a lookup is done in the local default.bridge.0 table. If remote Nodes also subscribe to that core key, then the ingress Node will have a 0/32 route for that particular VLAN in which these next hop interfaces will be listed:

� All the local interfaces which are active in that VLAN.

� All the FTE links that point to the different Nodes which subscribe to that VLAN.

Depending upon the contents of this route, the BUM traffic is replicated out of all the next hop interfaces that are listed.

Before sending this traffic out on the FTE ports, the ingress Node adds the fabric header. The fabric header includes the Multicast Core Key information. Since the Interconnects also get programmed with the Multicast Core Key information, they do a lookup only on the fabric header and are able to determine the appropriate next hop interfaces on which this multicast traffic should be sent.

Note that replication of BUM traffic can happen at the Interconnects as well (in case an Interconnect needs to send BUM traffic towards two different Nodes).

Once the egress Nodes receive this traffic on their FTE links, they discard the fabric header and do local replication (if necessary) and flood this data out on all the local ports which are active in the corresponding VLAN.

Page 74: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

72 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

QFabric technology uses a proprietary hash-based load balancing algorithm that ensures that one Interconnect is not overloaded with all the responsibility of repli-cating and sending BUM traffic to all the egress Nodes. The internal details of the load balancing algorithm are simply beyond the scope of this book.

Layer 3 Traffic (Destination Prefix is Learned on the Local Node)

The Network Node Group abstraction is the routing brains for a QFabric system, since all the Layer 3 protocols run at the active Network Node Group-VM. To examine this traffic, let’s first look at Figure 4.8.

Figure 4.8 Network Node Group’s Connections

In Figure 4.8, the prefixes for Host-A and Host-B are learned by the QFabric’s Network Node Group-VM. Both these prefixes are learned from the routers that are physically located behind Node-1. Assuming that Host-A starts sending some traffic to Host-B, here is the sequence of steps that would take place:

1. Data reaches Node-1. Since this is a case for routing, the destination MAC address would be that of the QFabric. The destination IP address would be that of Host-B.

2. Node-1 does a local lookup and finds that the prefix is learned locally.

3. Node-1 decrements the TTL and sends the data towards R2 after making appropriate changes to the Ethernet header.

4. Note that since the prefix was learned locally, the data is never sent out on the FTE links.

As Figure 4.8 illustrates, the QFabric system acts as a regular networking router in this case. The functionality here is to make sure that the QFabric system obeys all

Page 75: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter4:DataPlaneForwarding 73

the basic laws of networking, such as resolving ARP for the next hop router’s IP address, etc. Also, the ARP resolution shown is for a locally connected route and that was discussed as a separate case study earlier in this chapter.

Just like a regular router, the QFabric system makes sure that the TTL is also decre-mented for all IP-routed traffic before it leaves the egress Node.

Layer 3 Traffic (Destination Prefix is Learned on a Remote Node)

The scenario of a QFabric system that receives traffic to be Layer 3 routed, builds on the previous scenario. However, here the destination prefix is located behind a remote Node (see Figure 4.7).

OSPF is running between R3 and the QFabric system and that’s how the prefix of Host-C is propagated to the QFabric system.

Again, referring back to Figure 4.7, let’s assume that Host-A starts sending traffic to Host-C. The sequence would be:

1. Data reaches Node-1. Since this is a case for routing, the destination MAC address would be that of the QFabric system, and the destination IP address would be that of Host-C.

2. Node-1 does a lookup and finds that the destination prefix points to a remote Node.

3. In order to send this data to the correct Node, Node-1 adds the fabric header and sends the data to one of the Interconnects. Note that the fabric header contains the destination Node’s PFE-ID.

4. Before adding the fabric header and sending the traffic towards the Interconnect, Node-1 decrements the TTL of the IP packets by one. Decrementing TTL at the ingress Node makes sure that the QFabric system doesn’t have to include this overhead at the egress Node. Besides, the ingress Node has to do an IP lookup. Hence, it is the most lookup-efficient or latency-efficient way to forward packets. Note that the egress Node will not have to do an IP lookup. The fabric header also includes information about the port on the egress Node from which the traffic needs to be sent.

5. The Interconnect receives this traffic from Node-1. The Interconnects always look up the fabric header. In this case, the fabric header reveals that the traffic must be sent to the PFE-id of Node-2. The Interconnect sends this traffic out of the 40G port that points to Node-2.

6. Node-2 receives this traffic on its FTE link. A lookup is done on the fabric header. Node-2 finds the egress port for which traffic should be sent out. Before sending the traffic towards R2, Node-2 removes the fabric header.

End-to-End Ping Between Two Hosts Connected to Different Nodes

Let’s consider a real-world QFabric example like the one shown in Figure 4.9.

Page 76: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

74 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

Figure 4.9 Real World Example of a QFabric Deployment

Figure 4.9 shows a QFabric system that is typical of one that might be deployed in a Data Center. This system has one Redundant-Server-Node group called RSNG-1. Node-1 and Node-2 are part of RSNG-1 and Node-1 is the master Node. Node-3 and Node-4 are member Nodes of the Network Node Group abstraction. Let’s assume DG0 is the master, and hence, the active Network Node Group-VM resides on DG0. The CPE and the Director devices are not shown in Figure 4.9.

Server-1 is dual-homed and is connected to both the Nodes of RSNG-1. The links coming from Server-1 are bundled together as a LAG on the QFabric system.

Node-3 and Node-4 are connected to a router R1. R1’s links are also bundled up as a LAG. There is OSPF running between the QFabric system and router R1 and R1 is advertising the subnet of Host-2 towards the QFabric system. Note that the routing functionality of the QFabric system resides on the active Network Node Group-VM. Hence the OSPF adjacency is really formed between the VM and R1.

Finally, let’s assume that this is a new QFabric system (no MAC addresses or IP routes have been learned). So, the sequence of steps would be as follows.

At RSNG-1

At RSNG-1 the sequence of steps for learning Server-1’s MAC address would be:

1. Node-1 is the master Node within RSNG-1. This means that the active RE resides on Node-1.

2. Server-1 sends some data towards the QFabric system. Since Server-1 is connected to both Node-1 and Node-2 using a LAG, this data can be received on either of the Nodes.

� 2a. If the data is received on Node-1, then the MAC address of Server-1 is learned locally in VLAN-1.

Page 77: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter4:DataPlaneForwarding 75

� 2b. If the data is received on Node-2, then it must be first sent over to the active-RE so that MAC-address learning can take place. Note that this first frame is sent to Node-1 over the CPE links. Once the MAC address of Server-1 is learned locally, this data is no longer sent to Node-1.

3. Once the MAC address is learned locally on Node-1, it must also send this Layer 2 route to all other Nodes that are active in VLAN-1. This is done using the Fabric Control-VM as discussed in Chapter 3.

Network Node Group

The sequence for learning Host-2’s prefix for the Network Node Group would be:

1. R1 is connected via a LAG to both Node-3 and Node-4. R1 is running OSPF and sends out an OSPF Hello towards the QFabric system.

2. The first step is to learn the MAC address of R1 in VLAN-2.

3. In this case, the traffic is incoming on a Node that is a part of the Network Node Group. This means that the REs on the Nodes are disabled and all the learning needs to take place at the Network Node Group VM.

4. This initial data is sent over the CPE links towards the master-DG (DG0). Once the DG receives the data, it is sent to the Network Node Group-VM.

5. The Network Node Group-VM learns the MAC address of R1 in VLAN-2 and distributes this route to all the other Nodes that are active in VLAN-2. This is again done using the Fabric Control-VM.

6. Note that the OSPF Hello was already sent to the active-Network Node Group VM. Since OSPF is enabled on the QFabric system as well, this Hello is processed.

7. Following the rules of OSPF, the necessary OSPF-packets (Hellos, DBD, etc.) are exchanged between the active-Network Node Group-VM and R1 and the adjacency is established and routes are exchanged between the QFabric and R1.

8. Note that whenever Node-3 or Node-4 receive any OSPF packets, they send the packets out of their CPE links towards the active DG so that this data can reach the Network Node Group-VM. This Control plane data is never sent out on the FTE links.

9. Once the Network Node Group-VM learns these OSPF routes from R1, it again leverages the internal-BGP peering with the Fabric Control-VM to distribute these routes to all the Nodes that are a part of this routing instance.

Ping on Server-1

After QFabric has learned the Layer 2 route for Server-1 and the Layer 3 route for Host-2, let’s assume that a user initiates a ping on Server-1. The destination of the ping is entered as the IP address of Host-2. Here is the sequence of steps that would take place in this situation:

1. A ping is initiated on Server-1. The destination for this ping is not in the same subnet as Server-1. As a result, Server-1 sends out this traffic to its default gateway (which is the RVI for VLAN-1 on the QFabric system).

2. This data reaches the QFabric system. Let’s say this data comes in on Node-2.

Page 78: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

76 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

3. At this point in time, the destination-MAC address would be the QFabric’s MAC address. Node-2 does a lookup and finds out that the destination IP address is that of Host-2. A routing lookup on this prefix reveals the next hop of R1.

4. In order to send this data to R2, the QFabric system also needs to resolve the ARP for R2’s IP address. The connected interface that points to R1 is the RVI for VLAN-2. Also, VLAN-2 doesn’t exist locally on Node-2.

� Technically, the QFabric would have resolved the ARP for R1 while forming the OSPF adjacency. That fact was omitted here to illustrate the complete sequence of steps for end-to-end data transfer within a QFabric system.

5. This is the classic use case in which the QFabric must resolve an ARP for a VLAN that doesn’t exist on the ingress Node.

6. As a result, Node-2 encapsulates this data with a fabric header and sends it out of its FTE links towards the Network Node Group Nodes (Node-3 in this example.)

7. The fabric header would have Node-3’s PFE-id. The Interconnects would do a lookup on the fabric header and send this data over to Node-3 so that it could be sent further along to the Network Node Group VM for ARP resolution.

8. Node-3 sends this data to the master DG over the CPE links. The master DG in turn sends it to the active Network Node Group VM.

9. Once the active Network Node Group VM receives this, it knows that ARP must be resolved for R1’s IP address in VLAN-2. The Network Node Group VM generates an ARP request packet and sends it to all the Nodes that are active in VLAN-2. (Note that this communication takes place over the CPE network.)

10. Each Node that is active in VLAN-2 receives this ARP-request packet on its CPE links. This ARP request is then replicated by the Nodes and flooded on all the revenue ports that are active in VLAN-2.

11. This is true for Node-3 and Node-4 as well. Since the link to R1 is a LAG, only one of these Nodes sends out the ARP request towards R1.

12. R1 sends out an ARP reply and it is received on either Node-3 or Node-4.

13. Since ARP learning is a control plane function, this ARP reply is sent towards the master DG so that it can reach the active Network Node Group VM.

14. The VM learns the ARP for R1 and then sends out this information to all the Nodes that are active in the corresponding routing instance.

15. At this point in time, Node-1 and Node-2 know how to reach R1.

16. Going back to Step# 4, now Node-2 knows how to route traffic to R2, and the local tables on Node-2 would suggest the next hop of Node-3 to reach R1.

17. Since the data to be sent between Server-1 and Host-2 has to be routed, a Layer 3, Node-2 decrements the IP TTL and adds the fabric-header on the traffic. The fabric-header contains the PFE-id of Node-3.

18. After adding the fabric-header, this data is sent out on the FTE links towards one of the Interconnects.

19. The Interconnects do a lookup on the fabric-header and determine that all this traffic should be sent to Node-3.

Page 79: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

Chapter4:DataPlaneForwarding 77

20. Node-3 receives this traffic on its FTE links and sends this data out towards R1 after modifying the Ethernet-header.

External Device Forming a BGP Adjacency

Refer back to Figure 4.9 if need be, but let’s consider the scenario when there is a BGP adjacency between the QFabric and R1 instead of OSPF. Here is how the BGP Hellos would flow:

1. A BGP peer is configured on the QFabric system.

2. BGP is based on unicast Hellos. The QFabric system knows that the peer (R1) is reachable through Node-4. Node-4’s Data plane (PFE) is programmed accordingly to reach R1.

3. Node-4 is a part of the Network Node Group. Hence the RE functionality is present only on the active NW-NG-VM, which resides on the DG.

4. The active NW-NG-VM generates the corresponding BGP packets (Hellos, updates, etc.) and sends these packets to Node-4 via the CPE network. Note that the DGs are not plugged in to the Data plane at all. The only way a packet originated on the DGs (VMs) makes it to the Data plane is through the CPE.

5. These BGP packets reach Node-4. Node-4’s PFE already has the information to reach R1. These packets are forwarded in Data plane to R1.

6. R1 replies back with BGP packets.

7. Node-4 looks at the destination MAC address and knows that this should be processed locally. This packet is sent to the active NW-NG-VM via the CPE.

8. The packet reaches the active NW-NG-VM.

9. This is how bidirectional communication takes place within a QFabric system.

NOTE This is a rather high-level sequence of events that take place for the BGP peering between the QFabric and R1 and does not take into account all the things that need to be done before peering, such as learning R1’s MAC address, learning R1’s ARP, etc.

Test Your Knowledge

Q: Consider that an MX router is BGP peers with a QFabric system. What is the path the BGP Hellos take?

� A QFabric system can only run BGP through the Network Node Groups. Also, the Routing Engine for the Network Node Group is present on the VMs that run on the Director devices. Hence in this case, the BGP Hellos will enter the QFabric system on a Node device configured to be part of the Network Node Group. From there, the BGP Hellos would be sent over to the NW-NG-0 VM via the Control plane Ethernet segment.

Q: If a Node device receives a broadcast frame on one of its ports, on which ports would it be flooded?

� The Node device will flood on all the ports that are a part of that VLAN's broadcast tree. This would include all the ports on this Node that are active in

Page 80: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

78 ThisWeek:QFabricSystemTrafficFlowsandTroubleshooting

the VLAN and also the 40GbE FTE links (one or more) in case some other Nodes also have ports active in this VLAN.

Q: What extra information is added to the data that is sent out on the 40GbE FTE links?

� Every Node sevice that is a part of a QFabric system adds a fabric header to data before sending it out of the FTE links. The fabric header contains the PFE-ID of the remote Node device where the data should be sent.

Q: How can the PFE-ID of a Node be obtained?

� Using the CLI command show fabric multicast vccpdf-adjacency. Then co-relate this output with the output of show virtual chassis CLI command. Consider the following snippets taken from an RSNG:

qfabric-admin@RSNG0> show fabric multicast vccpdf-adjacencyFlags: S - StaleSrc Src Src Dest Src DestDev id INE Dev type Dev id Interface Flags Port Port9 34 TOR 256 n/a -1 -19 34 TOR 512 n/a -1 -110 259(s) TOR 256 fte-0/1/1.32768 1 3 10 259(s) TOR 512 fte-0/1/0.32768 0 311 34 TOR 256 n/a -1 -111 34 TOR 512 n/a -1 -112 259(s) TOR 256 fte-1/1/1.32768 1 2 12 259(s) TOR 512 fte-1/1/0.32768 0 2

The Src Dev Id column shows the PFE-IDs for all the Nodes, while the Interface column shows the IDs of all the interfaces that are connected to the Interconnects, but only for those Node devices that are a part of the Node group (RSNG0 in this case). (Note that the traditional Junos interface format is used: namely, FPC/PIC/PORT.) You can see by the bolded output that Node device with PFE-ID of 10 corresponds to PIC-0 and Node device with PFE-ID 12 corresponds to PIC-1.

The next step is to correlate this output with the show virtual-chassis command:

qfabric-admin@RSNG0> show virtual-chassisPreprovisioned Virtual ChassisVirtual Chassis ID: 0000.0103.0000 MstrMember ID Status Model prio Role Serial No0 (FPC 0) Prsnt qfx3500 128 Master* P6810-C1 (FPC 1) Prsnt qfx3500 128 Backup P7122-C{master}

You can see that the Node device that corresponds to FPC-0 has the serial number of P6810-C and the one with PFE Id of 1 has the serial-number P7122-C. The aliases of these Nodes can then be checked by either looking at the configuration or by issuing the show fabric administration inventory command.

Page 81: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide
Page 82: This Week: QFabric System Traffic Flows and Troubleshooting · 2014-05-13 · Since Data Centers host mission critical applications, redundancy is of prime impor - tance. To provide

80

Books for Cloud Building and Hi-IQ Networks

The following books can be download as free PDFs from www.juniper.net/dayone:

� This Week: Hardening Junos Devices

� This Week: Junos Automation Reference for SLAX 1.0

� This Week: Mastering Junos Automation

� This Week: Applying Junos Automation

� This Week: A Packet Walkthrough on the M, MX, and T Series

� This Week: Deploying BGP Multicast VPNs, Second Edition

� This Week: Deploying MPLS