Professional Documents
Culture Documents
This material is copyrighted and licensed for the sole use by Dimitar Stojanovski (dimitar.s@gmail.com [164.143.240.34]). More information at http://www.ipSpace.net/Webinars
SDN AND OPENFLOW
THE HYPE AND THE HARSH REALITY
Ivan Pepelnjak, CCIE#1354 Emeritus
The information is provided on an as is basis. The authors, and ipSpace.net shall have neither
liability nor responsibility to any person or entity with respect to any loss or damages arising from
the information contained in this book.
I remember the first time I heard him on a podcast, I thought to myself "This guy must be super
smart, because he sounds like a Bond villain and I can only grasp 50% of what he's saying." I
started telling colleagues about him, "Hey, check this guy out. His webinars will make your brain
bleed out of your ears!" Trust me, in my circle that's a HUGE compliment.
When I was chosen to attend my first Tech Field Day event, I was most excited because I would
finally get to meet Ivan in person. All my engineering friends were jealous and I was almost
apoplectic when the moment finally arrived, fearful I would do something foolish like confuse SMTP
and SNMP. This is when I discovered a really wonderful aspect to Ivan, if you're ever lucky enough
to interact with him personally (stalking doesn't count), you'll find him to be witty, friendly,
generous and gracious. He never makes you feel stupid for not understanding a protocol, the details
of an RFC or an IEEE standard.
He's the consummate educator and a giving mentor to almost anyone who asks. The more I know
him, the more I admire and respect his dedication to engineering. It truly is a vocation for him.
Michele Chubirka
Security architect, analyst, writer and podcaster
December 2013
More than three years later, the media still doesnt understand the basics of SDN, and many
networking engineers feel threatened by what they see as a fundamental shift in the way they do
their jobs.
In the meantime, I published over a hundred blog posts on ipSpace.net trying to debunk the myths,
explain how SDN and OpenFlow work, and what their advantages and limitations are. Most of the
posts were responses to external triggers false claims, vendor launches, or questions I received
from my readers.
This book contains a collection of the most relevant blog posts describing the concepts of SDN and
OpenFlow. I cleaned up the blog posts and corrected obvious errors and omissions, but also tried to
leave most of the content intact. The commentaries between the individual blog posts will help you
understand the timeline or the context in which a particular blog post was written.
The debunking of the initial hype surrounding OpenFlow public launch and the most blatant
misconceptions (Chapter 1);
Overview of what SDN is, what it benefits might be, and deliberations whether or not it makes
sense (Chapter 2);
Youll find additional SDN- and OpenFlow-related information on ipSpace.net web site:
As always, please do feel free to send me any questions you might have the best way to reach me
is to use the contact form on my web site (www.ipSpace.net).
Happy reading!
Ivan Pepelnjak
July 2014
Since then, every single vendor started offering SDN products. Almost none of them come even
close to the (narrow) vision promoted by the Open Networking Foundation (centralized control plane
with distributed data plane), NECs ProgrammableFlow being a notable exception.
Most vendors decided to SDN-wash their existing products, branding their existing APIs Open, and
claiming they have SDN-enabled products.
MORE INFORMATION
Youll find additional SDN- and OpenFlow-related information on ipSpace.net web site:
This material is copyrighted and licensed for the sole use by Dimitar Stojanovski (dimitar.s@gmail.com [164.143.240.34]). More information at http://www.ipSpace.net/Webinars
As usual, the industry media didnt help they enthusiastically jumped onto the OpenFlow/SDN
bandwagon and started propagating myths. More than two years later they still dont understand the
fundamentals of SDN, and tend to focus exclusively on how SDN is supposed to hurt Cisco (or not).
IN THIS CHAPTER:
OPEN NETWORKING FOUNDATION FABRIC CRAZINESS REACHES NEW HEIGHTS
OPENFLOW FAQ: WILL THE HYPE EVER STOP?
OPENFLOW IS LIKE IPV6
FOR THE RECORD: I AM NOT AGAINST OPENFLOW
NETWORK FIELD DAY FIRST IMPRESSIONS
I APOLOGIZE, BUT IM EXCITED
THE REALITY TWO YEARS LATER
CONTROL AND DATA PLANE SEPARATION THREE YEARS LATER
TWO AND A HALF YEARS AFTER OPENFLOW DEBUT, THE MEDIA REMAINS CLUELESS
WHERES THE REVOLUTIONARY NETWORKING INNOVATION?
FALLACIES OF GUI
Networking vendors, either trying to protect their margins by stalling the progress of this initiative,
or stampeding into another Wild West Gold Rush (hoping to unseat their bigger competitors with
low-cost standard-based alternatives) have joined the foundation in hordes; the list of initial
members reads like Whos Who in Networking.
Now, lets try to figure out what SDN might be all about. The ONF Mission Statement (on the first
page) says SDN allows owners and operators of networks to control and manage their networks to
best serve their needs. Are the founding members of ONF trying to tell us they have no control over
their networks and lack network management systems? It must be something else. How about this
one (from the same paragraph): OpenFlow seeks to increase network functionality while lowering
(Some of) the industry media happily joined the craze, parroting meaningless phrases from various
press releases. Consider, for example, this article from IT World Canada.
SDN would give network operators the ability to virtualize network resources, being able to
dynamically improve latency or security on demand If you want to do it, you can do it today, using
dynamic routing protocols or QoS (latency), vShield/VSG (on-demand security) or a number of
virtualized networking appliances.
Also, protocols like RSVP to signal per-session bandwidth needs have been around for more than a
decade, but somehow never caught on. Must be the fault of those stupid networking vendors.
Sites like Facebook, Google or Yahoo would be able to tailor their networks so searches would be
blindingly fast I never realized the main search problem was network bandwidth. I always somehow
thought it was related to large datasets, CPU, database indices ... Anyhow, if the network bandwidth
is the bottleneck, why dont they upgrade to the next-generation Ethernet (10G/40G). Ah, yes, it
might be expensive. How about deploying Clos network architecture? Ouch, might be a nightmare to
configure and manage. How exactly will SDN solve this problem?
Stock exchanges could assure brokerage customers on the other side of the globe theyd get
financial data as fast as a dealer beside the exchange. Will SDN manage to flatten & shrink the
earth, will it change the speed of light, or will it use large-scale quantum entanglement?
It could be programmed to order certain routers to be powered down during off-peak power
periods. What stops you from doing that today?
However, there are plenty of open standards in the networking industry (including XML-based
network configuration and management) waiting to be used. There are also (existing, standard)
technologies that you can use to solve most of the problems these people are complaining about.
The problem is that these standards and technologies are not used by operating systems or
applications (when was the last time youve deployed a server running OSPF to have seamless
multihoming?)
The main problems were facing today arise primarily from non-scalable application architectures
and broken TCP/IP stack. In a world with scale-out applications you dont need fancy combinations
of routing, bridging and whatever-else; you just need fast L3 transport between endpoints. In an
Internet with decent session layer or a multipath transport layer (be it SCTP, Multipath TCP or
something else) you dont need load balancers, BGP sessions with end-customers to support
multihoming, or LISP. All these kludges were invented to support OS/App people firmly believing in
fallacies of distributed computing. How is SDN supposed to change that? Im anxiously waiting to
see an answer beyond marketing/positioning/negotiating bullshit bingo.
NW: OpenFlow is a programmable network protocol designed to manage and direct traffic among
routers and switches from various vendors. This one is just a tad misleading. OpenFlow is actually a
protocol that allows a controller to download forwarding tables into one or more switches. Whether
that manages or directs traffic depends on what controller is programmed to do.
NW: The technology consists of three parts: [...] and a proprietary OpenFlow protocol for the
controller to talk securely with switches. Please do decide what you think proprietary means. All
parts of the OpenFlow technology are defined in publicly available documents under BSD-like
license.
NW: MPLS is a Layer 3 technique while OpenFlow is a Layer 2 method Do I need to elaborate on
this gem? Lets just point out that OpenFlow works with MAC addresses, IP subnets, IP flow 5-
tuples, VLANs or MPLS labels. Whatever a switch can do, OpenFlow can control it.
But wait ... OpenFlow has no provision for IPv6 at all. Maybe Network World is so futuristic they
consider a technology without IPv6 support a layer-2 technology.
To understand his statement, remember that OpenFlow is nothing more than a standardized version
of communication protocol between control and data plane. It does not define a radically new
architecture, it does not solve distributed or virtualized networking challenges and it does not create
new APIs that the applications could use. The only thing it provides is the exchange of TCAM (flow)
data between a controller and one or more switches.
Cold fusion-like claims are nothing new in the IT industry. More than a decade ago another group of
people tried to persuade us that changing the network layer address length from 32 bits to 128 bits
and writing it in hex instead of decimal solves global routing and multihoming and improves QoS,
security and mobility. After the reality distortion field collapsed, we were left with the same set of
problems exacerbated by the purist approach of the original IPv6 architects.
Did we have a similar functionality in the past? If not, why not? Was there no need or were the
vendors too lazy to implement it (don't forget they usually follow the money)?
Did it work? If not, why not?
If it did - do we really need a new technology to replace a working solution?
Did it get used? If not, why not? What were the roadblocks? Why would OpenFlow remove them?
Repeat this exercise regularly and youll probably discover the new emperors clothes arent nearly
as shiny as some people would make you believe.
On the more technological front, I still dont expect to see miracles. Most OpenFlow-related ideas
Ive heard about have been tried (and failed) before. I fail to see why things would be different just
because we use a different protocol to program the forwarding tables.
The vendor and user presentations weve seen at that symposium, combined with the vendor
presentations weve attended during the Networking Tech Field Day 2 seemed very promising
everyone was talking about the right topics and tried to address real-life scalability concerns.
Explosion of innovation and its not just OpenFlow and/or SDN. Last year weve seen some great
products and a few good ideas (earning me the grumpy old man thats hard to make smile fame),
this year almost every vendor had something that excited me.
If you were watching the video stream, you probably got sick and tired of my wow, thats cool
comments. I apologize, but thats how I felt.
Everyone gets the problem ... and some of the vendors were trying to tell us what the problem is in
an CIO-level pitch. Not a good idea. However, its refreshing to see that everyone identified the
same problem (large-scale data centers, VM mobility ...), that its the problem were all familiar
with, and that its actually getting solved.
Layer-2 is fading away (again). While every switching vendor will tell you how you can build large L2
domains with their fabric, nobody is actually pushing them anymore. And the only time layer-2 Data
Center Interconnect (DCI) appeared on a slide, there was a unicorn image next to it. Even more,
two vendors actually said they think long-distance VM mobility is not a good idea (youll have to
watch the videos to figure out who they were).
Were cutting through the hype. Even the OpenFlow symposium was hypeless. Its so nice being able
to spend three days with highly intelligent people who are excited about the next great thing
(whatever it is), while being perfectly realistic about its current state and its limitations.
Youll see lots of new things in the future. Even if youre working in an SMB environment, you might
get exposed to OpenFlow in the not-too-distant future (more about that in an upcoming post).
Get ready for a bumpy ride. Lots of exciting technologies are being developed. Some of them make
perfect sense, some others less so. Some of them might work, some might fade away (not because
they would be inherently bad, but because of bad execution). Now is the time to jump on those
bandwagons get involved (hint: you just might start with IPv6), build a test lab, kick the tires,
figure out whether the new technologies might be a good fit for your environment when they
become stable.
Disclosure: vendors mentioned in this post indirectly covered my travel expenses. Read the full
disclosure (or a more precise one by Tony Bourke).
Watching the presentations from the OpenFlow symposium is a great starting point. I would start
with the ones from Igor Gashinsky (Yahoo!) and Ed Crabbe (Google) they succinctly explained the
problems theyre facing in their networks and how they feel OpenFlow could solve them. If youre an
IaaS cloud provider, this is the time to start thinking about potentials OpenFlow could bring to your
network, and if youre not talking to NEC, BigSwitch or Nicira, youre missing out. I would also talk
with Juniper (more about that later).
Next step: watch the vendor presentations from the OpenFlow symposium. Kyle Forster presented a
high-level overview of Big Switch architecture, Curt Beckmann from Brocade added a healthy dose
of reality check (highly appreciated), David Meyer (Cisco) presented an interesting perspective on
robustness and complexity (and several OpenFlow use cases), Don Clark from NEC talked about
The afternoon technical Q&A panel just confirmed that numerous vendors well understand the
challenges associated with OpenFlow deployments outside of small lab setups, and that theyre
actively working on solving those problems and making OpenFlow a viable technology.
Two vendors expanded their coverage of OpenFlow during the Network Field Day: David Ward from
Juniper did a technical deep dive (dont skip the Junos automation part at the beginning of the
video, its interesting ... and you just might spot the VRF Smurf) and NEC even showed us a demo
of their OpenFlow-based switched network.
Luckily there are still some coolheaded people around (read Ethan Banks OpenFlow State of the
Union and Derick Winkworths More Open Flow Symposium Notes), but I cant help myself. The
grumpy old man from L3 ivory tower is excited (listen to PacketPushers OpenFlow/SDN podcast if
you dont believe me), and not just about OpenFlow. I still cant believe that I stumbled upon so
many interesting or cool technologies or solutions in the last few days. Could be that its just
vendors adapting to the blogging audience, or there actually might be something fundamentally new
coming to light like MPLS (then known as tag switching) was in the late 1990s.
Disclosure: vendors mentioned in this post indirectly covered my travel expenses. Read the full
disclosure (or a more precise one by Tony Bourke).
Every major vendor is talking about SDN, but its mostly SDN-washing (aka CLI-in-API-disguise).
Cisco is talking about OnePK, and has shipping early adopter SDK kit, but it will take a while before
we see OnePK in GA code on a widespread platform.
Startups arent doing any better. Big Switch is treading water and trying to find a useful use case for
their controller. Nicira was acquired by VMware and is moving away from OpenFlow. Contrail was
acquired by Juniper and recently shipped its product (which has nothing to do with OpenFlow and
not much with SDN). LineRate Systems was acquired by F5 and disappeared.
We havent seen customer deployments either. Facebook is doing interesting things (but from what
Ive heard theyre not OpenFlow-based), Google has an OpenFlow/SDN deployment, but they could
have done the exact same thing with classical routers and PCEP, Microsofts SDN is based on BGP
(and works fine).
It seems like the reality hit OpenFlow and it was a very hard hit and according to Gartner we
havent reached the trough of disillusionment yet.
Since I wrote this blog post, Facebook launched their own switch operating system, which seems to
be working along the same lines as classical network operating systems (one device, one control
plane).
Google implemented their inter-DC WAN network with switches that use OpenFlow within a
switching fabric and BGP/IS-IS and something akin to PCEP between sites;
Facebook is working on the networking platform for their Open Compute Project. It seems
theyve got to switch hardware specs; I havent heard about software running on those switches
yet or maybe theyll go down the same path as Google (We got cheap switches, and we have
our own software. Goodbye and thank you!)
In the networking vendor world, NEC seems to be the only company with a mature commercial
product that matches the ONF definition of SDN. Cisco has just shipped the initial version of their
controller, as did HP, and those products seem pretty limited at the moment.
Wondering why I didnt include Big Switch Networks in the above list? My definition of shipping
includes publicly available product documentation, or (at the very minimum) something resembling
a data sheet with feature description, system requirements and maximum limits. I couldnt find
either on Big Switch web site.
On the other hand, the virtual networking world was always full of solutions with separate control
and data planes, starting with the venerable VMware Distributed vSwitch and Nexus 1000V, and
continuing with newer entrants, from Hyper-V extensible switch and VMware NSX to Juniper Contrail
and IBMs 5000V and DOVE. Some of these solutions were used years before the explosion of
OpenFlow/SDN hype (only we didnt know we should call them SDN).
[SDN] takes the high-end features built into routers and switches and puts them into
software that can run on cheaper hardware. Corporations still need to buy routers and
switches, but they can buy fewer of them and cheaper ones.
SDN cannot move hardware features into software. If a device relies on hardware forwarding,
you cannot move the same feature into software without significantly impacting the forwarding
performance.
SDN software runs on cheaper hardware. Ignoring the intricacies of custom ASICs and
merchant silicon (and the fact that Cisco produces more custom ASICs than all merchant silicon
vendors combined), complexity and economies of scale dictate the hardware costs. Its pretty hard
to make cheaper hardware with the same performance and feature set.
However, all networking vendors bundle the software with the hardware devices and expense R&D
costs (instead of including them in COGS) to boost their perceived margins.
Corporations can buy fewer routers and switches. It cant get any better than this. If you need
100 10GE ports, you need 100 10GE ports. If you need two devices for two WAN uplinks (for
redundancy), you need two devices. SDN wont change the port count, redundancy requirements, or
laws of physics.
Corporations can buy cheaper [routers and switches]. Guess what you still need the
software to run them, and until we see price tags of SDN controllers, and do a TCO calculation,
claims like this one remain wishful thinking (you did notice Im extremely diplomatic today, didnt
you?).
Much as I agree with him, we cant change much on planet Earth due to the fact that VMs use
Ethernet NICs (so we need some form of VLANs to cater to infinite creativity of some people), IP
addresses (so we need L3 forwarding), broken TCP stack (requiring load balancers to fix it), and
obviously cant be relied upon to be sufficiently protected (so we need external firewalls).
Furthermore, unless we manage to stop shifting the problems around, the networking as a whole
wont get simpler.
What overlay network virtualization does bring us is a decoupling that makes physical infrastructure
less complex so it can focus on packet forwarding instead of zillions of customer-specific features
preferably baked in custom ASICs. Obviously thats not a good thing for everyone out there.
FALLACIES OF GUI
I love Greg Ferros characterization of CLI:
We need to realise that the CLI is a power tools for specialist tradespeople and not a
knife and fork for everyday use.
However, you do know that most devices GUI offers nothing more than what CLI does, dont you?
Wheres the catch?
For whatever reason, people find colorful screens full of clickable items less intimidating than a
blinking cursor on black background. Makes sense after all, you can see all the options you have;
you can try pulling down things to explore possible values, and commit the changes once you think
you enabled the right set of options. Does that make a product easier to use? Probably. Will it result
in better-performing product? Hardly.
Have you ever tried to configure OSPF through GUI? How about trying to configure usernames and
passwords for individual wireless users? In both cases youre left with the same options youd have
in CLI (because most vendors implement GUI as eye candy in front of the CLI or API). If you know
how to configure OSPF or RADIUS server, GUI helps you break the language barrier (example:
moving from Cisco IOS to Junos), if you dont know what OSPF is, GUI still wont save the day ... or
it might, if you try clicking all the possible options until you get one that seems to work (expect a
few meltdowns on the way if youre practicing your clicking skills on a live network).
That definition definitely suits one of the ONF founding members (Google), but is it relevant to the
networking community at large? Or does it make more sense to focus on network programmability,
or using existing protocols (BGP) in novel ways?
This chapter contains my introductory posts on the SDN-related topics, musings on what makes
sense, and a few thoughts on career changes we might experience in the upcoming years. Youll find
more details in subsequent chapters, including an overview of OpenFlow, in-depth analysis of
OpenFlow-based architectures, some real-life OpenFlow and SDN deployments, and alternate
approaches to SDN.
MORE INFORMATION
Youll find additional SDN- and OpenFlow-related information on ipSpace.net web site:
This material is copyrighted and licensed for the sole use by Dimitar Stojanovski (dimitar.s@gmail.com [164.143.240.34]). More information at http://www.ipSpace.net/Webinars
IN THIS CHAPTER:
WHAT EXACTLY IS SDN (AND DOES IT MAKE SENSE)?
BENEFITS OF SDN
DOES CENTRALIZED CONTROL PLANE MAKE SENSE?
HOW DID SOFTWARE DEFINED NETWORKING START?
WE HAD SDN IN 1993 AND DIDNT KNOW IT
STILL WAITING FOR THE STUPID NETWORK
IS CLI IN MY WAY OR IS IT JUST A SYMPTOM OF A BIGGER PROBLEM?
OPENFLOW AND SDN DO YOU WANT TO BUILD YOUR OWN RACING CAR?
SDN, WINDOWS AND FRUITY ALTERNATIVES
SDN, CAREER CHOICES AND MAGIC GRAPHS
RESPONSE: SDNS CASUALTIES
[SDN is] The physical separation of the network control plane from the forwarding plane,
and where a control plane controls several devices.
Does this definition make sense or is it too limiting? Is there more to SDN? Would a broader scope
make more sense?
A BIT OF A HISTORY
Its worth looking at the founding members of ONF and their interests: most of them are large cloud
providers looking for cheapest possible hardware, preferably using a standard API so it can be
sourced from multiple suppliers, driving the prices even lower. Most of them are big enough to write
their own control plane software (and Google already did).
A separation of control plane (running their own software) and data plane (implemented in a low-
cost white-label switches) was exactly what they wanted to see, and the Stanford team working on
Will physical separation of control and forward plane solve any of these? It might, but there are
numerous tools out there that can do the same without overhauling everything weve been doing in
the last 30 years.
NOW WHAT?
Does it make sense to accept the definition of SDN that makes sense to ONF founding members but
not to your environment? Shall we strive for a different definition of SDN or just move on, declare it
as meaningless as the clouds, and focus on solving our problems? Would it be better to talk about
NetOps?
Maybe we should stop talking and start doing there are plenty of things you can do within existing
networks using existing protocols.
BENEFITS OF SDN
Paul Stewart wrote a fantastic blog post in May 2014 listing the potential business benefits of SDN
(as promoted by SDN evangelists and SDN-washing vendors).
I have just one problem with this list Ive seen a similar list of benefits of IPv6:
Unfortunately, the reality of IT in general and IPv6 in particular is a bit different. The overly hyped
IPv6 benefits remain myths and legends; all we got were longer addresses, incompatible protocols
(OSPFv3 anyone), and half-thought-out implementations (example: DNS autoconfiguration) ridden
with religious wars (try to ask why dont we have first-hop router in DHCPv6 on any IPv6 mailing
list ;).
For more information, watch the fantastically cynical presentation Enno Rey had @ Troopers 2014
IPv6 Security summit, or my IPv6 resources.
You've stated a couple of times that you don't favor the OpenFlow version of SDN due to
a variety of problems like scaling and latency. What model/mechanism do you like?
Hybrid? Something else?
Before answering the question, lets step back and ask another one: Does centralized control plane,
as evangelized by ONF, make sense?
A BIT OF HISTORY
As always, lets start with one of the greatest teachers: history. Weve had centralized architectures
for decades, from SNA to various WAN technologies (SDH/SONET, Frame Relay and ATM). They all
share a common problem: when the network partitions, the nodes cut off from the central
intelligence stop functioning (in SNA case) or remain in a frozen state (WAN technologies).
One might be tempted to conclude that the ONF version of SDN wont fare any better than the
switched WAN technologies. Reality is far worse:
Interestingly, MPLS-TP wants to reinvent the glorious past and re-introduce centralized path
management, yet again proving RFC 1925 section 2.11.
The last architecture (that I remember) that used truly centralized control plane was SNA, and if
youre old enough you know how well that ended.
Interestingly, numerous data center architectures already use centralized control plane, so we can
analyze how well they perform:
NEC ProgrammableFlow seems to be an outlier they can control up to 200 switches, for a total of
over 9000 GE (not 10GE) ports but they dont run any control-plane protocol (apart from ARP and
dynamic MAC learning) with the outside world. No STP, LACP, LLDP, BFD or routing protocols.
One could argue that we could get an order of magnitude beyond those numbers if only we were
using proper control plane hardware (Xeon CPUs, for example). I dont buy that argument till I
actually see a production deployment, and do keep in mind that NEC ProgrammableFlow Controller
uses decent Intel-based hardware. Real-time distributed systems with fast feedback loops are way
more complex than most people looking from the outside realize (see also RFC 1925, section 2.4).
Finally, do keep in mind that the whole world of IT is moving toward scale-out architectures. Netflix
& Co are already there, and the enterprise world is grudgingly doing the first steps. In the
meantime, OpenFlow evangelists talk about the immeasurable revolutionary merits of centralized
scale-up architecture. They must be living on a different planet.
I finally found the answer in a fantastic overview of technologies and ideas that led to OpenFlow and
SDN published in December 2013 issue of acmqueue. According to that article, SDN first appeared in
an article published by MIT Technology Review that explains how Nick McKeown and his team at
Stanford use OpenFlow:
Frustrated by this inability to fiddle with Internet routing in the real world, Stanford
computer scientist Nick McKeown and colleagues developed a standard called OpenFlow
that essentially opens up the Internet to researchers, allowing them to define data flows
using software--a sort of "software-defined networking."
You did notice the a sort of classification and quotes around SDN, didnt you? Its pretty obvious
how the article uses software-defined networking to illustrate the point but once marketing took
over all hope for reasonable discussion was lost, and SDN became even more meaningless as cloud.
In 1993 we were (among other things) an Internet Service Provider offering dial-up and leased line
Internet access. Being somewhat lazy, we hated typing the same commands in every time we had
to provision a new user (in pre-TACACS+ days we had to use local authentication to have
autocommand capability for dial-up users) and developed a solution that automatically changed
the router configurations after we added a new user. Heres a high-level diagram of what we did:
HTML user interface (written in Perl) gave the operators easy access to user database (probably
implemented as a text file we were true believers in NoSQL movement in those days), and a back-
end Perl script generated router configuration commands from the user definitions and downloaded
them (probably through rcp the details are a bit sketchy) to the dial-up access servers.
Next revision of the software included support for leased line users the script generated interface
configurations and static routes for our core router (it was actually an MGS, but I found no good
MGS images on the Internet) or one of the access server (for users using asynchronous modems).
How is that different from all the shiny new stuff vendors are excitedly talking about? Beats me, I
cant figure it out ;) and as I said before, you dont always need new protocols to solve old
problems.
Here are a few juicy quotes from that article (taken completely out of context solely for your
enjoyment).
The telcos seemed to "fall asleep at the switch" at the core of their network.
The Intelligent Network impedes innovation. Existing features are integrally spaghetti-
coded into the guts of the network, and new features must intertwine with the old.
The whole article is well worth reading, more so considering its over 15 years old and still spot-on.
Were all sick of CLI. I dont think anyone would disagree. However, CLI is not our biggest
problem. We happen to be exposed to the CLI on a daily basis due to lack of automation tools and
lack of abstraction layer; occasional fights with the usual brown substance flowing down the
application stack dont help either.
The CLI problem is mostly hype. The we need to replace CLI with (insert-your-favorite-gizmo)
hype was generated by SDN startups (one in particular) that want to sell their disruptive way of
doing things to the venture capitalists. BTW, the best way to configure their tools is through CLI.
CLI is still the most effective way of doing things ask any really proficient sysadmin, web
server admin or database admin how they manage their environment. Its not through point-and-
click GUI, its through automation tools coupled with simple CLI commands (because automation
tools dont work that well when they have to simulate mouse clicks).
The true difference between other IT fields and networking is that the other people did something to
solve their problems while we keep complaining. Networking is no worse than any other IT
discipline; we just have to start moving forward, create community tools, and vote with our wallets.
Whenever you have a choice between two comparable products from different vendors, buy the one
that offers greater flexibility and programmability. Dont know what to look for? Talk with your
server- and virtualization buddies (I hope youre on speaking term with them, or its high time you
buy them a beer or two). If they happen to use Puppet or Chef to manage servers, you might try to
use the same tools to manage your routers and switches. Your favorite boxes dont support the tools
used by the rest of your IT? Maybe its time to change the vendor.
Imagine you want to build your own F1 racing car... but the only component you got is a super-
duper racing engine from Mercedes Benz. You're left with the "easy" task of designing the car body,
suspension, gears, wheels, brakes and a few other choice bits and pieces. You can definitely do all
that if you're Google or McLaren team, but not if you're a Sunday hobbyist mechanic. No wonder
some open-source OpenFlow controllers look like Red Bull Flugtag contestants.
Does that mean we should ignore OpenFlow? Absolutely not, but unless you want to become really
fluent in real-time event-driven programming (which might look great on your resume), you should
join me watching from the sidelines until there's a solid controller (maybe we'll get it with Daylight,
Floodlight definitely doesn't fit the bill) and some application architecture blueprints.
Of course hes right and while, as Bob Plankers explains, you can never escape some lock-in (part 1,
response from Greg Ferro, part 2 all definitely worth reading), you do have to ask yourself am I
looking for Windows or Mac?
There are all sorts of arguments one hears from Mac fanboys (heres a networking related one) but
regardless of what you think of Mac and OSX, theres the undisputable truth: compared to reloadful
experience we get on most Windows-based boxes, Macs and OSX are rock solid; I have to reboot
my Macbook every other blue moon. Even Windows is stable when running on a Macbook (apart
from upgrade-induced reboots).
Before you start praising Steve Jobs and blaming Bill Gates and Microsoft at large, consider a simple
fact: OSX runs on a tightly controller hardware platform built with stability and reliability in mind.
Windows has to run on every possible underperforming concoction a hardware vendor throws at you
(example: my high-end laptop cannot record system audio because the 6-letter hardware vendor
wanted to save $0.02 on the sound chipset and chose the cheapest possible one), and has to deal
with all sort of crap third-party device drivers loaded straight into the operating system kernel.
If youre young and brazen (like I was two decades ago), go ahead and be your own system
integrator. If youre too old and covered with vendor-inflicted scars, you might prefer a tested end-
to-end solution regardless of what Gartner says in vendor-sponsored reports (and even solutions
that vendor X claims were tested dont always work). Just dont forget to consider the cost of
downtime in your total-cost-of-ownership calculations.
I have 8 plus years in Cisco, have recently passed my CCIE RS theory, and was looking
forward to complete the lab test when this SDN thing hit me hard. Do you suggest
completing the CCIE lab looking at this new future of Networking?
Short answer: the sky is not falling, CCIE still makes sense, and IT will still need networking people.
However, as I recently collected a few magic graphs for a short keynote speech, let me reuse them
to illustrate this particular challenge were all facing. Starting with the obvious, heres the legendary
Diffusion of Innovations: every idea is first adopted by a few early adopters, followed by early and
late majority.
Networking in general is clearly in the late majority/laggards phase. Whats important for our
discussion is the destruction of value-add through the diffusion process. Oh my, I sound like a
freshly-baked MBA whiz-kid, lets reword it: as a technology gets adopted, more people understand
it, the job market competition increases, and thus its harder to get a well-paying job in that
particular technology area. Supporting Windows desktops might be a good example.
Initially every new idea is a great unknown, with only a few people brave enough to invest time in it
(CCIE R&S before Cisco made it mandatory for Silver/Gold partner status). After a while, the
successful ideas explode into stars with huge opportunities and fat margins (example: CCIE R&S a
decade ago, Nicira-style SDN today at least for Niciras founders), degenerates into a cash cow as
Does it make sense to invest into something thats probably in a cash cow stage? The theory says
as much as needed to keep it alive, but dont forget that CCIE R&S will likely remain very relevant
a long time:
The protocol stacks were using havent changed in the last three decades (apart from extending
the address field from 32 to 128 bits), and although people are working on proposals like MP-
TCP, those proposals are still in experimental stage;
Regardless of all the SDN hoopla, neither OpenFlow nor other SDN technologies address the real
problems were facing today: lack of session layer in TCP and the use of IP addresses in
application layer. They just give you different tools to implement todays kludges.
Cisco is doing constant refreshes of its CCIE programs to keep them in the early adopters or
early majority technology space, so the CCIE certification is not getting commoditized.
If you approach the networking certifications the right way, youll learn a lot about the principles
and fundamentals, and youll need that knowledge regardless of the daily hype.
Now that Ive mentioned experimental technologies dont forget that not all of them get adopted
(even by early adopters). Geoffrey Moore made millions writing a book that pointed out that obvious
fact. Of course he was smart enough to invent a great-looking wrapper he called it Crossing the
Chasm.
The crossing the chasm dilemma is best illustrated with Gartner Hype Cycles. After all the initial
hype (that weve seen with OpenFlow and SDN) resulting in peak of inflated expectations, theres
the ubiquitous through of disillusionment. Some technologies die in that quagmire; in other more
successful cases we eventually figure out how to use them (slope of enlightenment).
We still dont know how well SDN will be doing crossing the chasm (according to the latest Gartners
charts, OpenFlow still hasnt reached the hype peak - I dread what's still lying ahead of us); weve
seen only a few commercial products and none of them has anything close to widespread adoption
(not to mention the reality of three IT geographies).
Finally, dont ask me for what will the next big thing be advice. Browse through the six years of
my blog posts. You might notice a clear shift in focus; its there for a reason.
The resulting flurry of expected blog posts included an interesting one from Steven Iveson in which
he made a good point: its easy for the cream-of-the-crop not to be concerned, but what about
others lower down the pile. As always, it makes sense to do a bit of reality check.
While everyone talks about SDN, the products are scarce, and it will take years before theyll
appear in a typical enterprise network. Apart from NECs Programmable Flow and overlay
networks, most other SDN-washed things Ive seen are still point products.
Overlay virtual networks seem to be the killer app of the moment. They are extremely useful and
versatile ... if youre not bound to VLANs by physical appliances. Well have to wait for at least
another refresh cycle before we get rid of them.
Data center networking is hot and sexy, but its only a part of what networking is. I havent seen
a commercial SDN app for enterprise WAN, campus or wireless (Im positive Im wrong write a
comment to correct me), because thats not where the VCs are looking at the moment.
Also, consider that the my job will be lost to technology sentiments started approximately 200 years
ago and yet the population has increased by almost an order of magnitude in the meantime, there
Obviously you should be worried if youre a VLAN provisioning technician. However, with everyone
writing about SDN you know whats coming down the pipe, and you have a few years to adapt,
expand the scope of your knowledge, and figure out where it makes sense to move (and dont forget
to focus on where you can add value, not what job openings you see today). If you dont do any of
the above, dont blame SDN when the VLANs (finally) join the dinosaurs and you have nothing left to
configure.
Finally, Im positive there will be places using VLANs 20 years from now. After all, AS/400s and
APPN are still kicking and people are still fixing COBOL apps (that IBM just made sexier with XML
and Java support).
Did you ever encounter Catalyst 5000 with Route Switch Module (RSM), or a combination of Catalyst
5000 and an external router, using Multilayer Switching (MLS)? Those products used architecture
identical to OpenFlow almost 20 years ago, the only difference being the relative openness of
OpenFlow protocol.
MORE INFORMATION
Youll find additional SDN- and OpenFlow-related information on ipSpace.net web site:
This material is copyrighted and licensed for the sole use by Dimitar Stojanovski (dimitar.s@gmail.com [164.143.240.34]). More information at http://www.ipSpace.net/Webinars
What is OpenFlow?
What can different versions of OpenFlow do?
How can a controller implement control-plane protocols (like LACP, STP or routing protocols)
and does it have to?
Can we deploy OpenFlow in combination with traditional forwarding mechanisms?
IN THIS CHAPTER:
MANAGEMENT, CONTROL AND DATA PLANES IN NETWORK DEVICES AND SYSTEMS
WHAT EXACTLY IS THE CONTROL PLANE?
WHAT IS OPENFLOW?
WHAT IS OPENFLOW (PART 2)?
OPENFLOW PACKET MATCHING CAPABILITIES
OPENFLOW ACTIONS
OPENFLOW DEPLOYMENT MODELS
FORWARDING MODELS IN OPENFLOW NETWORKS
YOU DONT NEED OPENFLOW TO SOLVE EVERY AGE-OLD PROBLEM
OPENFLOW AND IPSILON: NOTHING NEW UNDER THE SUN
Process the transit traffic (thats why we buy them) in the data plane;
Figure out whats going on around it with the control plane protocols;
Interact with its owner (or Network Management System NMS) through the management
plane.
Routers are used as a typical example in every text describing the three planes of operation, so lets
stick to this time-honored tradition:
Interfaces, IP subnets and routing protocols are configured through management plane
protocols, ranging from CLI to NETCONF and the latest buzzword northbound RESTful API;
The management plane is pretty straightforward, so lets focus on a few intricacies of the control
and data planes.
We usually have routing protocols in mind when talking about Control plane protocols, but in reality
the control plane protocols perform numerous other functions including:
Data plane should be focused on forwarding packets but is commonly burdened by other activities:
Data plane forwarding is hopefully performed in dedicated hardware or in high-speed code (within
the interrupt handler on low-end Cisco IOS routers), while the overhead activities usually happen on
the device CPU (sometimes even in userspace processes the switch from high-speed forwarding to
user-mode processing is commonly called punting).
In reactive OpenFlow architectures a punting decision sends a packet all the way to the
OpenFlow controller.
Other control plane protocols (BGP, OSPF, LDP, LACP, BFD ...) are more clear-cut they run
between individual network devices (usually adjacent, but theres also targeted LDP and multihop
BGP) and could be (at least in theory) made to run across a separate control plane network (or
VRF).
Control plane protocols usually run over data plane interfaces to ensure shared fate if the
packet forwarding fails, the control plane protocol fails as well but there are scenarios
(example: optical gear) where the data plane interfaces cannot process packets, forcing you
to run control plane protocols across a separate set of interfaces.
Typical control plane protocols arent data-driven: BGP, LACP or BFD packet is never sent as a direct
response to a data plane packet.
ICMP is different: some ICMP packets are sent as replies to other ICMP packets, others are triggered
by data plane packets (ICMP unreachables and ICMPv6 neighbor discovery).
Vendor terminology doesnt help us either most vendors talk about Control Plane Policing or
Protection. These mechanisms usually apply to control plane protocols as well as data plane packets
punted from ASICs to the device CPU.
Even IETF terminology isnt exactly helpful while C in ICMP does stand for Control, it doesnt
necessarily imply control plane involvement. ICMP is simply a protocol that passes control messages
(as opposed to user data) between IP devices.
Honestly, Im stuck. Is ICMP a control plane protocol thats triggered by data plane activity or is it a
data plane protocol? Can you point me to an authoritative source explaining what ICMP is? Share
your thoughts in the comments!
WHAT IS OPENFLOW?
A typical networking device (bridge, router, switch, LSR ...) runs all the control protocols (including
port aggregation, STP, TRILL, MAC address learning and routing protocols) in the control plane
(usually implemented in central CPU or supervisor module), and downloads the forwarding
instructions into the data plane structures, which can be simple lookup tables or specialized
hardware (hash tables or TCAMs).
In architectures with distributed forwarding hardware the control plane has to use a communications
protocol to download the forwarding information into data plane instances. Every vendor uses its
own proprietary protocol (Cisco uses IPC InterProcess Communication to implement distributed
CEF); OpenFlow tries to define a standard protocol between control plane and associated data plane
elements.
The OpenFlow zealots would like you to believe that were just one small step away from
implementing Skynet; the reality is a bit more sobering. You need a protocol between control and
data plane elements in all distributed architectures, starting with modular high-end routers and
switches. Almost every modular high-end switch that you can buy today has one or more supervisor
modules and numerous linecards performing distributed switching (preferably over a crossbar
matrix, not over a shared bus). In such a switch, OpenFlow-like protocol runs between supervisor
module(s) and the linecards.
You might have noticed that all vendors support a limited number of high-end switches in a central
control plane architecture (Ciscos VSS cluster has two nodes and HPs IRF cluster can have up to
four high-end switches). This decision has nothing to do with vendor lock-in and lack of open
protocols but rather reflects the practical challenges of implementing a high-speed distributed
architecture (alternatively, you might decide to believe the whole networking industry is a
confusopoly of morons who are unable to implement what every post-graduate student can simulate
with open source tools).
Moving deeper into the technical details, the OpenFlow Specs page on the OpenFlow web site
contains a link to the OpenFlow Switch Specification v1.1.0, which defines:
The designers of OpenFlow had to make the TCAM structure very generic if they wanted to offer an
alternative to numerous forwarding mechanisms implemented today. Each entry in the flow tables
contains the following fields: ingress port, source and destination MAC address, ethertype, VLAN tag
& priority bits, MPLS label & traffic class (starting with OpenFlow 1.1), IP source and destination
address (and masks), layer-4 IP protocol, IP ToS bits and TCP/UDP port numbers.
You can pass metadata between tables to make the architecture even more versatile.
The proposed flow table architecture is extremely versatile (and Im positive theres a PhD thesis
being written proving that it is a superset of every known and imaginable forwarding paradigm), but
it will have to meet the harsh reality before well see a full-blown OpenFlow switch products. You can
implement the flow tables in software (in which case the versatility never hurts, but youll have to
wait a few years before the Moore Law curve catches up with terabit speeds) or in hardware where
the large TCAM entries will drive the price up.
I dont think OpenFlow is clearly defined yet. Is it a protocol? A model for Control plane
Forwarding plane FP interaction? An abstraction of the forwarding-plane? An automation
technology? Is it a virtualization technology? I dont think there is consensus on these things
yet.
OpenFlow is very well defined. Its a control plane (controller) data plane (switch) protocol that
allows control plane to:
As part of the protocol, OpenFlow defines abstract data plane structures (forwarding table entries)
that have to be implemented by OpenFlow-compliant forwarding devices (switches).
Is it an abstraction of the forwarding plane? Yes, as far as it defines data structures that can be used
in OpenFlow messages to update data plane forwarding structures.
Alternatively, you could use OpenFlow to create additional forwarding (actually packet dropping)
entries in access switches or wireless access points deployed throughout your network, resulting in a
scalable multi-vendor ACL solution.
Is it a virtualization technology? Of course not. However, its data structures can be used to perform
MAC address, IP address or MPLS label lookup and push user packets into VLANs (or push additional
VLAN tags to implement Q-in-Q) or MPLS-labeled frames, so you can implement most commonly
used virtualization techniques (VLANs, Q-in-Q VLANs, L2 MPLS-based VPNs or L3 MPLS-based VPNs)
with it.
Theres no reason you couldnt control soft switch (embedded in the hypervisor) with OpenFlow. An
open-source hypervisor switch implementation (Open vSwitch) that has many extensions for
virtualization is already available and can be used with Xen/XenServer (its the default networking
stack in XenServer 6.0) or KVM.
Open vSwitch became the de-facto OpenFlow switch reference implementation. Its used by
many hardware and software vendors, including VMware, which uses Open vSwitch in the
multi-hypervisor version of NSX.
Summary: OpenFlow is like C++. You can use it to implement all sorts of interesting solutions, but
its just a tool.
OTHER OPTIONS
OpenFlow switches might not support all match conditions specified in the OpenFlow version
they support. For example, most data center switches dont support MPLS or PBB matching.
Furthermore, some switches might implement certain matching actions in software. For
example, early OpenFlow code for HP Procurve switches implemented layer-3 forwarding in
hardware and layer-2 forwarding in software, resulting in significantly reduced forwarding
performance.
OPENFLOW ACTIONS
Every OpenFlow forwarding entry has two components:
Flow match specification, which can use any combination of fields listed in the previous table;
List of actions to be performed on the matched packets.
Initial OpenFlow specification contained the basic actions one needs to implement MAC- and IPv4
forwarding as well as actions one might need to implement NAT or load balancing. Later versions of
the OpenFlow protocol added support for MPLS, IPv6 and Provider Backbone Bridging (PBB).
An OpenFlow switch OpenFlow switches might not support all actions specified in the
OpenFlow version they support. For example, most switches dont support MAC, IP address
or TCP/UDP port number rewrites.
Process the packet through specified group (example: LAG or fast 1.1
failover)
OTHER OPTIONS
Not surprisingly, the traditional networking vendors quickly moved from OpenFlow-only approach to
a plethora of hybrid solutions.
To make it even more interesting, at least four different models for OpenFlow deployment have
already emerged:
This model has at least two serious drawbacks even if we ignore the load placed on the controller by
periodic control-plane protocols:
The switches need IP connectivity to the controller for the OpenFlow control session. They can
use out-of-band network (where OpenFlow switches appear as IP hosts), similar to the QFabric
architecture. They could also use in-band communication sufficiently isolated from the OpenFlow
network to prevent misconfigurations (VLAN 1, for example), in which case they would probably
have to run STP (at least in VLAN 1) to prevent bridging loops.
Fast control loops like BFD are hard to implement with a central controller, more so if you want
to have very fast response time.
NEC seems to be using this model quite successfully (although they probably have a few
extensions), but already encountered inherent limitations: a single controller can control up to ~50
switches and rerouting around failed links takes around 200 msec (depending on the network size).
For more details, watch their Networking Tech Field Day presentation.
NEC has since enhanced the scalability of their controller a single controller cluster can
manage over a 200 switches.
OpenFlow got multipathing support in version 1.1. In late 2013 there are only a few
commercially-available switches supporting OpenFlow 1.3 (vendors decided to skip versions
1.1 and 1.2).
Some controller vendors went down that route and significantly extended OpenFlow 1.1. For
example, Nicira has added support for generic pattern matching, IPv6 and load balancing.
Needless to say, the moment you start using OpenFlow extensions or functionality implemented
locally on the switch, you destroy the mirage of the nirvana described at the beginning of the article
were back in the muddy waters of incompatible extensions and hardware compatibility lists. The
specter of Fiber Channel looms large.
This approach is commonly used in academic environments where OpenFlow is running in parallel
with the production network. Its also one of the viable pilot deployment models.
INTEGRATED OPENFLOW
OpenFlow classifiers and forwarding entries are integrated with the traditional control plane. For
example, Junipers OpenFlow implementation inserts compatible flow entries (those that contain only
destination IP address matching) as ephemeral static routes into RIB (Routing Information Base).
OpenFlow-configured static routes can also be redistributed into other routing protocols.
From my perspective, this approach makes most sense: dont rip-and-replace the existing network
with a totally new control plane, but augment the existing well-known mechanisms with functionality
thats currently hard (or impossible) to implement. Youll obviously lose the vague promised benefits
of Software Defined Networking, but I guess that the ability to retain field-proven mechanisms while
adding customized functionality and new SDN applications more than outweighs that.
Before we get started, keep in mind OpenFlow is just a tool that one can use (or not) in
numerous environments. Toms question is (almost) equivalent to C programs use string
functions, right? Some do, some dont, depends on what youre trying to do.
Edge security policy authenticate users (or VMs) and deploy per-user ACLs before
connecting a user to the network (example: IPv6 first-hop security);
Programmable SPAN ports use OpenFlow entries on a single switch to mirror selected traffic
to SPAN port;
DoS traffic blackholing use OpenFlow to block DoS traffic as close to the source as possible,
using N-tuples for more selective traffic targeting than the more traditional RTBH approach.
Using OpenFlow on one or more isolated devices is simple (no interaction with adjacent devices) and
linearly scalable you can add more devices and controllers as needed because theres no tight
coupling anywhere in the system.
Not surprisingly, developers of these products took different approaches based on their
understanding of networking challenges and limitations of OpenFlow devices.
Some solutions (example: VMware NSX) bypass the complexities of fabric forwarding by
establishing end-to-end something-over-IP tunnels, effectively reducing the fabric to a
single hop.
Path-based forwarding. Install end-to-end path forwarding entries into the fabric and assign user
traffic to paths at the edge nodes (aka Edge and Core OpenFlow). Bonus points if youre smart
enough to pre-compute and install backup paths.
If this looks like a description of MPLS LSPs, FECs and FRR, youre spot on. There are only so many
ways you can solve a problem in a scalable way.
Unfortunately you wont see much PBB or MPLS in OpenFlow products any time soon they require
OpenFlow 1.3 (or vendor extensions) and hardware support thats often lacking in switches used for
OpenFlow forwarding these days. OpenFlow controller developers are trying to bypass those
problems with creative uses of packet headers (VLAN or MAC rewrite comes to mind), making a
troubleshooters job much more interesting.
Hop-by-hop forwarding. Install flow-matching N-tuples in every switch along the path. Results in
an architecture that works great in PowerPoint and lab tests, but breaks down in anything remotely
similar to a production network due to scalability problems, primarily FIB update challenges.
OK, heres the quote that ties them together. While describing rack awareness Brad wrote:
What is NOT cool about Rack Awareness at this point is the manual work required to
define it the first time, continually update it, and keep the information accurate. If the
rack switch could auto-magically provide the Name Node with the list of Data Nodes it
has, that would be cool. Or vice versa, if the Data Nodes could auto-magically tell the
Name Node what switch theyre connected to, that would be cool too. Even more
interesting would be a OpenFlow network, where the Name Node could query the
OpenFlow controller about a Nodes location in the topology.
LLDP has been standardized years ago and is available on numerous platforms, including Catalyst
and Nexus switches, and Linux operating system (for example, lldpad is part of the standard Fedora
distribution). Not to mention that every DCB-compliant switch must support LLDP as the DCBX
protocol uses LLDP to advertise DCB settings between adjacent nodes.
The LLDP MIB is standard and allows anyone with SNMP read access to discover the exact local LAN
topology the connected port names, adjacent nodes (and their names), and their management
addresses (IPv4 or IPv6). The management addresses that should be present in LLDP
advertisements can then be used to expand the topology discovery beyond the initial set of nodes
(assuming your switches do include it in LLDP advertisement; for example, NX-OS does but Force10
doesn't).
Building the exact network topology from LLDP MIB is a very trivial exercise. Even a somewhat
reasonable API is available (yeah, having an API returning a network topology graph would be even
cooler). Mapping the Hadoop Data Nodes to ToR switches and Name Nodes can thus be done on
existing gear using existing protocols.
Would OpenFlow bring anything to the table? Actually not, it also needs packets exchanged between
adjacent devices to discover the topology and the easiest thing for OpenFlow controllers to use is ...
ta-da ... LLDP ... oops, OFDP, because LLDP just wasnt good enough. The only difference is that in
the traditional network the devices would send LLDP packets themselves, whereas in the OpenFlow
world the controller would use Packet-Out messages of the OpenFlow control session to send LLDP
packets from individual controlled devices and wait for Packet-In messages from other device to
discover which device received them.
Last but definitely not least, you could use well-defined SNMP protocol with a number of readily-
available Linux or Windows libraries to read the LLDP results available in the SNMP MIB in the old
world devices. Im still waiting to see the high-level SDN/OpenFlow API; everything Ive seen so far
are OpenFlow virtualization attempts (multiple controllers accessing the same devices) and
discussions indicating standard API isnt necessarily a good idea. Really? Havent you learned
anything from the database world?
So, why did I mention the two posts at the beginning of this article? Because Bob pointed out that
those who cannot remember the past are condemned to fulfill it. At the moment, OpenFlow seems
to fit the bill perfectly.
I found a great overview of IP+ATM solutions in an article published on the University of Washington
web site. This is what the article has to say about Ipsilons approach (and if you really want to know
the details, read GSMP (RFC 1987) and Ipsilon Flow Management Protocol (RFC 1953)):
An IP switch controller routes like an ordinary router, forwarding packets on a default VC.
However, it also performs flow classification for traffic optimization.
Replace IP switch controller with OpenFlow controller and default VC with switch-to-controller
OpenFlow session.
As expected, Ipsilons approach had a few scaling issues. From the same article:
The bulk of the criticism, however, relates to Ipsilon's use of virtual circuits. Flows are
associated with application-to-application conversations and each flow gets its very own
VC. Large environments like the Internet with millions of individual flows would exhaust
VC tables.
Not surprisingly, a number of people (myself included) that still remember a bit of the networking
history are making the exact same argument about usage of microflows in OpenFlow environments
... but it seems RFC 1925 (section 2.11) will yet again carry the day.
An hour after publishing this blog post, I realized (reading an article by W.R. Koss) that Ed
Crabbe mentioned Ipsilon being the first attempt at SDN during his OpenFlow Symposium
presentation.
The blog post was written in 2011, when the shortcomings of OpenFlow werent that well
understood. Three years later (August 2014), all we have is a single production-grade commercial
controller (NEC ProgrammableFlow).
The OpenFlow protocol will definitely enable many copycat vendors to buy merchant silicon, put it
together and start selling their product with little investment in R&D (like the PC motherboard
manufacturers are doing today). I am also positive the silicon manufacturers (like Broadcom) will
If youre old enough to remember the original PCs from IBM, youll easily recognize the parallels.
IBM documented PC hardware architecture and BIOS API (you even got BIOS source code), allowing
numerous third-party vendors to build adapter cards (and later PC clones), but all those machines
had to run an operating system ... and most of them used MS-DOS (and later Windows). Almost
three decades later, vast majority of PCs still run on Microsofts operating systems.
Some people think that the potential adoption of OpenFlow protocol will magically materialize open-
source software to control the OpenFlow switches, breaking the bonds of proprietary networking
solutions. In reality, the companies that invested heavily in networking software (Cisco, Juniper, HP
and a few others) might be the big winners ... if they figure out fast enough that they should morph
into software-focused companies.
Cisco has clearly realized the winds are changing and started talking about inclusion of OpenFlow in
NX-OS operating system. I would bet their first OpenFlow implementation wont be an OpenFlow-
enabled Nexus switch.
Now imagine none of those APIs would be standardized (various mutually incompatible dialects of
Tcl used by Cisco IOS come to mind) thats the situation were facing in the SDN land today.
If we accept the analogy of OpenFlow being the x86 instruction set (its actually more like the p-
code machine from UCSD Pascal days, but lets not go there today), and all we want to do is to write
a simple script that will (for example) redirect the backup-to-tape traffic to secondary path during
peak hours, we need a standard API to get the network topology, create a path across the network,
Are you old enough to remember the video games for early IBM PC? None of them used MS-DOS.
They were embedded software solutions that you had to boot off a floppy disk (remember those?)
and then they took over all the hardware you had. Thats exactly what we have in the SDN land
today.
Dont try to tell me Ive missed Flowvisor an OpenFlow controller that allocates slices of actual
hardware to individual OpenFlow controllers. I havent; but using Flowvisor to solve this problem is
like using Xen (or KVM or ESXi) to boot multiple embedded video games in separate VMs. Not highly
useful for a regular guy trying to steer some traffic around the network (or any one of the other
small things that bother us), is it?
Also, dont tell me each SDN controller has an API. While NEC and startups like Big Switch Networks
are creating something akin to a network operating system that we could use to program our
network (no, I really dont want to deal with the topology discovery and fast failover myself), and
each one of them has an API, no two APIs are even remotely similar.
I still remember the days when there were at least a dozen operating systems running on top of
8088 processor, and it was a mission impossible to write a meaningful application that would run on
only a few of them without major porting efforts.
The only people truly interested in OpenFlow are the Googles of the world (Nicira is using
OpenFlow purely as an information transfer tool to get MAC-to-IP mappings into their
vSwitches);
Developers figure out all sorts of excellent reasons why their dynamic and creative work couldnt
possibly be hammered into tight confines of a standard API;
Nobody is interested in creating a Linux-like solution; everyone is striving to achieve the
maximum possible vendor lock-in;
We still dont know what were looking for.
The reality is probably a random mixture of all four (and a few others), but that doesnt change the
basic facts: until theres a somewhat standard and stable API (like SQL-86) that I could use with
SDN controllers from multiple vendors, Im better off using Cisco ONE or Junos XML API, otherwise
Im just trading lock-ins (as ecstatic users of umbrella network management systems would be more
than happy to tell you).
On the other hand, if I stick with Cisco or Juniper (and implement a simple abstraction layer in my
application to work with both APIs) at least I could be pretty positive theyll still be around in a year
or two.
OpenStack is using the same approach in its OVS Neutron plugin, and it seems Open Daylight aims
to reinvent that same wheel, replacing OVS plugin running on the hypervisor host agent with central
controller.
Does that mean that one should use OpenFlow to implement overlay virtual networks? Not really,
OpenFlow is not exactly the best tool for the job.
BTW, even this picture isnt all rosy Nicira had to implement virtual tunnels to work around the
OpenFlow point-to-point interface model.
Perform dynamic MAC learning in the OpenFlow controller all frames with unknown source MAC
addresses are punted to the controller, which builds the dynamic MAC address table and
downloads the modified forwarding information to all switches participating in a layer-2 segment.
This is the approach used by NECs ProgrammableFlow solution.
Drawback: controller gets involved in the data plane, which limits the scalability of the solution.
Offload dynamic MAC learning to specialized service nodes, which serve as an intermediary
between the predictive static world of virtual switching, and the dynamic world of VLANs. It
seems NVP used this approach in one of the early releases.
Drawback: The service nodes become an obvious chokepoint; an additional hop through a
service node increases latency.
Give up, half-ditch OpenFlow, and implement either dynamic MAC learning in virtual switches in
parallel with OpenFlow, or reporting of dynamic MAC addresses to the controller using a non-
OpenFlow protocol (to avoid data path punting to the controller). It seems recent versions of
VMware NSX use this approach.
Even without the ARP proxy functionality, someone has to reply to the ARP queries for the
default gateway IP address.
ARP is a nasty beast in an OpenFlow world its a control-plane protocol and thus not
implementable in the pure OpenFlow switches. The implementers have (yet again) two choices:
Punt the ARP packets to the controller, which yet again places the OpenFlow controller in the
forwarding path (and limits its scalability);
Solve layer-3 forwarding with a different tool (approach used by VMware NSX and distributed
layer-3 forwarding in OpenStack Icehouse).
IP forwarding table;
ARP table;
VM MAC-to-underlay IP table.
These three tables, combined with local layer-2 and layer-3 forwarding is all you need. Wouldnt it
be better to keep things simple instead of introducing yet-another less-than-perfect abstraction
layer?
IS OPENFLOW USEFUL?
OpenFlow is just a tool that allows you to install PBR-like forwarding entries into networking devices
using a standard protocol that should work across multiple vendors (more about that in another blog
post). From this perspective OpenFlow offers the same functionality as BGP FlowSpec or ForCES,
and a major advantage: its already implemented in networking gear from numerous vendors.
Where could you use PBR-like functionality? Im positive you already have a dozen ideas with
various levels of craziness; here are a few more:
OpenFlow has another advantage over BGP FlowSpec it has the packet-in and packet-out
functionality that allows the controller to communicate with the devices outside of the OpenFlow
network. You could use this functionality to implement new control-plane protocols or (for example)
interesting layered authentication scheme that is not available in off-the-shelf switches.
Can you build an OpenFlow-based network with existing hardware? Is it possible to build a multi-
vendor network? These questions are answered in the second half of the chapter, which focuses on
vendor-specific implementation details.
MORE INFORMATION
Youll find additional SDN- and OpenFlow-related information on ipSpace.net web site:
This material is copyrighted and licensed for the sole use by Dimitar Stojanovski (dimitar.s@gmail.com [164.143.240.34]). More information at http://www.ipSpace.net/Webinars
IN THIS CHAPTER:
CONTROL PLANE IN OPENFLOW NETWORKS
IS OPEN VSWITCH CONTROL PLANE IN-BAND OR OUT-OF-BAND?
IMPLEMENTING CONTROL-PLANE PROTOCOLS WITH OPENFLOW
LEGACY PROTOCOLS IN OPENFLOW-BASED NETWORKS
OPENFLOW 1.1 IN HARDWARE: I WAS WRONG
OPTIMIZING OPENFLOW HARDWARE TABLES
OPENFLOW SUPPORT IN DATA CENTER SWITCHES
MULTI-VENDOR OPENFLOW MYTH OR REALITY?
HYBRID OPENFLOW, THE BROCADE WAY
OPEN DAYLIGHT INTERNET EXPLORER OR LINUX OF THE SDN WORLD?
OpenFlow is an application-level protocol running on top of TCP (and optionally TLS) the controller
and controlled device are IP hosts using IP connectivity services of some unspecified control plane
network. Does that bring back fond memories of SDH/SONET days? It should.
As always, history is our best teacher: similar architectures commonly used out-of-band control-
plane networks.
On the other hand, out-of-band control plane network is safe: we know how to build a robust L3
network with traditional gear, and a controller bug cannot disrupt the control-plane communication.
I would definitely use this approach in data center environment, where the costs of implementing a
dedicated 1GE control-plane network wouldnt be prohibitively high.
Would the same approach work in WAN/Service Provider environments? Of course it would after
all, weve been using it forever to manage traditional optical gear. Does it make sense? It definitely
does if you already have an out-of-band network, less so if someone asks you to build a new one to
support their bleeding-edge SDN solution.
That solution would work under optimal circumstances on properly configured switches, but I would
still use an out-of-band control plane in networks with transit OpenFlow-controlled switches (a
transit switch being a switch passing control-plane traffic between controller and another switch).
Open vSwitch supports in-band control plane, but thats not the focus of this post.
If you buy servers with a half dozen interfaces (I wouldn't), then it makes perfect sense to follow the
usual design best practices published by hypervisor vendors, and allocate a pair of interfaces to user
traffic, another pair to management/control plane/vMotion traffic, and a third pair to storage traffic.
Problem solved.
Buying servers with two 10GE uplinks (what I would do) definitely makes your cabling friend happy,
and reduces the overall networking costs, but does result in slightly more interesting hypervisor
configuration.
Best case, you split the 10GE uplinks into multiple virtual uplink NICs (example: Cisco/s Adapter
FEX, Broadcom's NIC Embedded Switch, or SR-IOV) and transform the problem into a known
problem (see above) but what if you're stuck with two uplinks?
Figure 4-6: Hypervisor TCP/IP stack running in parallel with the Open vSwitch
Similar to the forwarding model, the OpenFlow controller designers could use numerous
implementation paths.
In real world, your shiny new network has to communicate with the outside world or you could
take the approach most controller vendors did, decide to pretend STP is irrelevant, and ask people
to configure static LAGs because youre also not supporting LACP.
OpenFlow protocol provides two messages the controllers can use to implement any control-plane
protocol they wish:
The Packet-out message is used by the OpenFlow controller to send packets through any port of
any controlled switch.
The Packet-in message is used to send messages from the switches to the OpenFlow controller.
You could configure the switches to send all unknown packets to the controller, or set up flow
matching entries (based on controllers MAC/IP address and/or TCP/UDP port numbers) to select
only those packets the controller is truly interested in.
For example, you could write a very simple implementation of STP (similar to what Avaya is doing
on their ERS-series switches when they run MLAG) where the OpenFlow controller would always
pretend to be the root bridge and shut down any ports where inbound BPDUs would indicate
someone else is the root bridge:
SUMMARY
OpenFlow protocol allows you to implement any control-plane protocol you wish in the OpenFlow
controller; if a controller does not implement the protocols you need in your data center, its not due
to lack of OpenFlow functionality, but due to other factors (fill in the blanks).
If the OpenFlow product youre interested in uses hybrid-mode OpenFlow (where the control plane
resides in the traditional switch software) or uses OpenFlow to program overlay networks (example:
Niciras NVP), you dont have to worry about its control-plane protocols.
If, however, someone tries to sell you software thats supposed to control your physical switches,
and does not support the usual set of protocols you need to integrate the OpenFlow-controlled
switches with the rest of your network (example: STP, LACP, LLDP on L2 and some routing protocol
on L3), think twice. If you use the OpenFlow-controlled part of the network in an isolated fabric or
small-scale environment, you probably dont care whether the new toy supports STP or OSPF; if you
want to integrate it with the rest of your existing data center network, be very careful.
At least one of the vendors offering OpenFlow controllers that manage physical switches has a
simple answer: use static LAG to connect your existing gear with our OpenFlow-based network
(because our controller doesnt support LACP), use static routes (because we dont run any routing
protocols) and dont create any L2 loops in your network (because we also dont have STP). If you
wonder how reliable that is, you obviously havent implemented a redundant network with static
routes before.
However, to be a bit more optimistic, the need for legacy protocol support depends primarily on how
the new solution integrates with your network.
Layer-2 gateways included with VMware NSX for multiple hypervisors support STP and
LACP. VM-based gateways included with VMware NSX for vSphere run routing protocols
(BGP, OSPF and IS-IS) and rely on underlying hypervisors support of layer-2 control plane
protocols (LACP and LLDP).
Hybrid OpenFlow solutions that only modify the behavior of the user-facing network edge (example:
per-user access control) are also OK. You should closely inspect what the product does and ensure it
doesnt modify the network device behavior you rely upon in your network, but in principle you
should be fine. For example, the XenServer vSwitch Controller modifies just the VM-facing behavior,
but not the behavior configured on uplink ports.
Rip-and-replace OpenFlow-based network fabrics are the truly interesting problem. Youll have to
connect existing hosts to them, so youd probably want to have LACP support (unless youre a
VMware-only shop), and theyll have to integrate with the rest of the network, so you should ask for
at least:
LACP, if you plan to connect anything but vSphere hosts to the fabric and youll probably need
a device to connect the OpenFlow-based part of the network to the outside world;
LLDP or CDP. If nothing else, they simplify troubleshooting, and they are implemented on almost
everything including vSphere vSwitch.
STP unless the OpenFlow controller implements split horizon bridging like vSpheres vSwitch, but
even then we need basic things like BPDU guard.
Call me a grumpy old man, but I wouldnt touch an OpenFlow controller that doesnt support the
above-mentioned protocols. Worst case, if I would be forced to implement a network using such a
controller, I would make sure its totally isolated from the rest of my network. Even then a single
point of failure wouldnt make much sense, so I would need two firewalls or routers and static
routing in redundant scenarios breaks sooner or later. You get the picture.
To summarize: dynamic link status and routing protocols were created for a reason. Dont allow
glitzy new-age solutions to daze you, or you just might experience a major headache down the road.
The trick lies in the NP-4 network processors from EZchip. These amazing beasts are powerful
enough to handle the linked tables required by OpenFlow 1.1; the researchers just had to
implement the OpenFlow API and compile OpenFlow TCAM structures into NP-4 microcode.
I have to admit Im impressed (and as some people know, thats not an easy task). It doesnt
matter whether the solution can handle full 100 Gbps or what the pps figures are; they got very far
very soon using off-the-shelf hardware, so it shouldnt be impossibly hard to repeat the performance
and launch a commercial product. The only question is the price of the NP-4 chipset (including
associated TCAM they were using) can someone build a reasonably-priced switch out of that
hardware?
That approach was good enough to get you a tick-in-the-box on RFP responses, but it fails miserably
when you try to get OpenFlow working in a reasonably sized network. On the other hand, many
problems people try to solve with OpenFlow, like data center fabrics, involve simple destination-only
L2 or L3 switching.
Problems that can be solved with destination-only L2- or L3 switching are so similar to what
were doing with traditional routing protocols that I keep wondering whether it makes sense
to reinvent that particular well-working wheel, but lets not go there.
The switching hardware vendors realized in the last months what the OpenFlow developers were
doing and started implementing forwarding optimizations they would install OpenFlow entries that
require 12-tuple matching in TCAM, and entries that specify only destination MAC address or
destination IP prefix in L2- and L3 switching structures (usually hash tables for L2 switching and
The vendors using this approach include Arista (L2), Cisco (L2), and Dell Force 10 (L2 and L3). HP is
using both MAC table and TCAM in its 5900 switch, but presents them as two separate tables to the
OpenFlow controller (at least that was my understanding of their documentation please do correct
me if I got it wrong), pushing the optimization challenge back to the controller.
All the information in this blog post comes from publicly available vendor documentation
(configuration guides, command references, release notes). NEC is the only vendor
mentioned in this blog post that does not have public documentation, so its impossible to
figure out (from the outside) what functionality their switches support.
Summary: Its neigh impossible to implement anything but destination-only L2+L3 switching at
scale using existing hardware (the latest chipsets from Intel or Broadcom arent much better) and
I wouldnt want to be a controller vendor dealing with idiosyncrasies of all the hardware out there
all you can do consistently across most hardware switches is forward packets (without rewrites),
drop packets, or set VLAN tags.
Does that mean weve entered the era of multi-vendor OpenFlow networking? Not so fast.
You see, building real-life networks with fast feedback loops and fast failure reroutes is hard. It took
NEC years to get a stable well-performing implementation, and they had to implement numerous
OpenFlow 1.0 extensions to get all the features they needed. For example, they circumvented the
flow update rate challenges by implementing a very smart architecture effectively equivalent to the
Edge+Core OpenFlow ideas.
In a NEC-only ProgrammableFlow network, the edge switches (be they PF5240 GE switches or
PF1000 virtual switches in Hyper-V environment) do all the hard work, while the core switches do
simple path forwarding. Rerouting around a core link failure is thus just a matter of path rerouting,
not flow rerouting, reducing the number of entries that have to be rerouted by several orders of
magnitude.
In a mixed-vendor environments, ProgrammableFlow controller obviously cannot use all the smarts
of the PF5240 switches; it has to fall back to the least common denominator (vanilla OpenFlow 1.0)
and install granular flows in every single switch along the path, significantly increasing the time it
takes to install new flows after a core link failure.
For the moment, the best advice I can give you is If you want to have a working OpenFlow data
center fabric, stick with NEC-only solution.
The traditional hybrid OpenFlow model (what Keith called hybrid switch) is well known (and
supported by multiple vendors): an OpenFlow-capable switch has two forwarding tables (or FIBs), a
regular one (built from source MAC address gleaning or routing protocol information) and an
OpenFlow-controlled one. Some ports of the switch use one of the tables, other ports the other.
Effectively, a hardware switch supporting hybrid switch OpenFlow is split into two independent
switches that operate in a ships-in-the-night fashion.
More interesting is the second hybrid mode Brocade supports: the hybrid port mode, where the
OpenFlow FIB augments the traditional FIB. Brocades switches using hybrid port approach can
operate in protected or unprotected mode:
Protected hybrid port mode uses OpenFlow FIB for certain VLANs or packets matching a packet
filter (ACL). This mode allows you to run OpenFlow in parallel (ships-in-the-night) with the
The set of applications that one can build with the hybrid OpenFlow is well known from policy-
based routing and traffic engineering to bandwidth-on-demand. However, Brocade MLX has one
more trick up its sleeve: it supports packet replication actions that can be used to implement
behavior similar to IP Multicast or SPAN port functionality. You can use that feature in environments
that need reliable packet delivery over UDP to increase the chance that at least a single copy of the
packet will reach the destination.
I like the hybrid approach Brocade took (its quite similar to what Juniper is doing with its integrated
OpenFlow) and the interesting new features (like the packet replication), but the big question
remains unanswered: where are the applications (aka OpenFlow controllers)? At the moment,
everyone (Brocade included) is partnering with NEC or demoing their gear with public-domain
controllers. Is this really the best the traditional networking vendors can do? I sincerely hope not.
Are you old enough to remember how Microsoft killed the browser market? After the World Wide
Web exploded (and caught Microsoft totally unprepared), there was a blooming browser market
(with Netscape being the absolute market leader). Microsoft couldnt compete in that market with an
immature product (Internet Explorer) and decided its best to destroy the market. They made
Internet Explorer freely available and the rest is history after the free product won the browser
wars (its hard to beat free and good enough) it took years for reasonable alternatives to emerge.
Not surprisingly, browser innovation almost stopped until Internet Explorer lost its dominant market
position.
Even if you dont remember Netscape Navigator, youve probably heard of Linux. Have you ever
wondered how you could get a high-quality open-source operating system for free? Check the list of
top Linux contributors (page 9-11 of the Linux Kernel Development report) Red Hat, Intel, Novell
and IBM. You might wonder why Intel and IBM invest in Linux. Its simple: the less users have to
So what will Daylight be? Another Internet Explorer (killing the OpenFlow controller market, Big
Switch in particular) or another Linux (a good product ensuring OpenFlow believers continue
spending money on hardware, not software)? I'm hoping we'll get a robust networking Linux, but
your guess is as good as mine.
MORE INFORMATION
Youll find additional SDN- and OpenFlow-related information on ipSpace.net web site:
This material is copyrighted and licensed for the sole use by Dimitar Stojanovski (dimitar.s@gmail.com [164.143.240.34]). More information at http://www.ipSpace.net/Webinars
This chapter describes numerous challenges every OpenFlow controller implementation has to
overcome to work well in large-scale environments. Use it as a (partial) checklist when evaluating
OpenFlow controller products and solutions.
IN THIS CHAPTER:
OPENFLOW FABRIC CONTROLLERS ARE LIGHT-YEARS AWAY FROM WIRELESS ONES
OPENFLOW AND FERMI ESTIMATES
50 SHADES OF STATEFULNESS
FLOW TABLE EXPLOSION WITH OPENFLOW 1.0 (AND WHY WE NEED OPENFLOW
1.3)
FLOW-BASED FORWARDING DOESNT WORK WELL IN VIRTUAL SWITCHES
PROCESS, FAST AND CEF SWITCHING AND PACKET PUNTING
CONTROLLER-BASED PACKET FORWARDING IN OPENFLOW NETWORKS
CONTROL-PLANE POLICING IN OPENFLOW NETWORKS
PREFIX-INDEPENDENT CONVERGENCE (PIC): FIXING THE FIB BOTTLENECK
FIB UPDATE CHALLENGES IN OPENFLOW NETWORKS
While OpenFlow-based data center fabrics and wireless controller-based networks look very similar
on a high-level PowerPoint diagram, in reality theyre light-years apart. Here are just a few
dissimilarities that make OpenFlow-based fabrics so much more complex than the wireless
controllers.
TOPOLOGY MANAGEMENT
Wireless controllers work with the devices on the network edge. A typical wireless access point has
two interfaces: a wireless interface and an Ethernet uplink, and the wireless controller isnt
managing the Ethernet interface or any control-plane protocols that interface might have to run. The
wireless access point communicates with the controller through an IP tunnel and expects someone
Data center fabrics are built from high-speed switches with tens of 10/40GE ports, and the
OpenFlow controller must manage topology discovery, topology calculation, flow placement, failure
detection and fast rerouting. There are zillions of things you have to do in data center fabrics that
you never see in a controller-based wireless network.
TRAFFIC FLOW
In traditional wireless networks all traffic flows through the controller (there are some exceptions,
but lets ignore them for the moment). The hub-and-spoke tunnels between the controller and the
individual access points carry all the user traffic and the controller is doing all the smart forwarding
decisions.
AMOUNT OF TRAFFIC
Wireless access points handle megabits of traffic, making a hub-and-spoke controller-based
forwarding a viable alternative.
FORWARDING INFORMATION
In a traditional controller-based wireless network, the access point forwarding is totally stupid the
access points forward the data between directly connected clients (if allowed to do so) or send the
data received from them into the IP tunnel established with the controller (and vice versa). Theres
no forwarding state to distribute; all an access point needs to know are the MAC addresses of the
wireless clients.
In an OpenFlow-based fabric the controller must distribute as much forwarding, filtering and
rewriting (example: decrease TTL) information as possible to the OpenFlow-enabled switches to
minimize the amount of traffic flowing through the controller.
Furthermore, smart OpenFlow controllers build forwarding information in a way that allows the
switches to cope with the link failures (the controller has to install backup entries with lower
matching priority); you wouldnt want to have an overloaded controller and burnt-out switch CPU
every time a link goes down, network topology is lost, and the switch (in deep panic) forwards all
the traffic to the controller.
The functionality of a good OpenFlow controller that proactively pre-programs backup forwarding
entries (example: NEC ProgrammableFlow) is very similar to MPLS Traffic Engineering with Fast
Reroute; you cannot expect its complexity to be significantly lower than that.
The other near-real-time wireless event is user authentication, which often takes seconds (or my
wireless network is severely misconfigured). Yet again, nothing critical; the controller can take its
time.
In data center fabrics, you have to react to a failure in milliseconds and reprogram the forwarding
entries on tens of switches (unless you know what youre doing and already installed the pre-
computed backup entries see above).
OpenFlow controllers that implement flow-based forwarding (flow entries are downloaded into the
switches for each individual TCP/UDP session a patently bad idea if I ever saw one) are designed
to handle millions of flow setups per second (not that the physical switches could take that load).
Comparing the two is misleading and hides the real scope of the problem; no wonder some people
would love you to believe otherwise because that makes selling the controller-based fabrics easier.
In reality, an OpenFlow controller managing a physical data center fabric is a complex piece of real-
time software, as anyone who tried to build a high-end switch or router has learned the hard way.
Every time someone tries to tell you what your problem is, and how their wonderful new gizmo will
solve it, its time for another Fermi estimate.
Data center bandwidth. A few weeks ago a clueless individual working for a major networking
vendor wrote a blog post (which unfortunately got pulled before I could link to it) explaining how
network virtualization differs from server virtualization because we dont have enough bandwidth in
the data center. A quick estimate shows a few ToR switches have all the bandwidth you usually need
(you might need more due to traffic bursts and number of server ports you have to provide, but
thats a different story).
VM mobility for disaster avoidance needs. A back-of-the-napkin calculation shows you cant
evacuate more than half a rack per hour over a 10GE link. The response I usually get when I prod
networking engineers into doing the calculation: OMG, thats just hilarious. Why would anyone want
to do that?
Scenario: web application(s) hosted in a data center with 10GE WAN uplink.
Questions:
How many new sessions are established per second (how many OpenFlow flows does the
controller have to install in the hardware)?
How many parallel sessions will there be (how many OpenFlow flows does the hardware have to
support)?
Using facts #3 and #4 we can estimate the total number of sessions needed for a single web page.
Its anywhere between 20 and 120, lets be conservative and use 20.
Using fact #1 and the previous result, we can estimate the amount of data transferred over a typical
HTTP session: 50KB.
Assuming a typical web page takes 5 seconds to load, a typical web user receives 200 KB/second
(1.6 mbps) over 20 sessions or 10KB (80 kbps) per session. Seems low, but do remember that most
of the time the browser (or the server) waits due to RTT latency and TCP slow start issues.
Always do a reality check. Is this number realistic? Load balancing vendors support way more
connections per second (cps) @ 10 Gbps speeds. F5 BIG-IP 4000s claims 150K cps @ 10 Gbps, and
VMware claims its NSX Edge Services Router (improved vShield Edge) will support 30K cps @ 4
Gbps. It seems my guestimate is on the lower end of reality (if you have real-life numbers, please
do share them in comments!).
Modern web browsers use persistent HTTP sessions. Browsers want to keep sessions established as
long as possible, web servers serving high-volume content commonly drop them after ~15 seconds
to reduce the server load (Apache is notoriously bad at handling very high number of concurrent
sessions). 25.000 cps x 15 seconds = 375.000 flow records.
Trident-2-based switches can handle 100K+ L4 OpenFlow entries (at least BigSwitch claimed so
when we met @ NFD6). Thats definitely on the low end of the required number of sessions at 10
Gbps; do keep in mind that the total throughput of a typical Trident-2 switch is above 1 Tbps or
three orders of magnitude higher. Enterasys switches support 64M concurrent flows @ 1Tbps, which
seems to be enough.
The flow setup rate on Trident-2-based switches is supposedly still in low thousands, or an order of
magnitude too low to support a single 10 Gbps link (the switches based on this chipset usually have
64 10GE interfaces).
Now is the time for someone to invoke the ultimate Moores Law spell and claim that the hardware
will support whatever number of flow entries in not-so-distant future. Good luck with that; Ill settle
for an Intel Xeon server that can be pushed to 25 mpps. OpenFlow has its uses, but large-scale
stateful services is obviously not one of them.
50 SHADES OF STATEFULNESS
A while ago Greg Ferro wrote a great article describing integration of overlay and physical networks
in which he wrote that an overlay network tunnel has no state in the physical network, triggering
an almost-immediate reaction from Marten Terpstra (of RIPE fame, now @ Plexxi) arguing that the
network (at least the first ToR switch) knows the MAC and IP address of hypervisor host and thus
has at least some state associated with the tunnel.
Marten is correct from a purely scholastic perspective (using his argument, the network keeps some
state about TCP sessions as well), but what really matters is how much state is kept, which
device keeps it, how its created and how often it changes.
The state granularity should get ever coarser as you go deeper into the network core edge
switches keep MAC address tables and ARP/ND caches of adjacent end hosts, core routers know
about IP subnets, routers in public Internet know about the publicly advertised prefixes (including
every prefix Bell South ever assigned to one of its single-homed customers), while the high-speed
MPLS routers know about BGP next hops and other forwarding equivalence classes (FECs)
Furthermore, as much state as possible should be stored in low-speed devices using software-based
forwarding. Its pretty simple to store a million flows in software-based Open vSwitch (updating
them is a different story) and mission-impossible to store 10.000 5-tuple flows in Trident 2 chipset
used by most ToR switches.
Data-plane-driven state is particularly problematic for devices with hardware forwarding packets
that change state (example: TCP SYN packets creating new NAT translation) might have to be
punted to the CPU.
Finally, theres the soft state cases where the protocol designers needed state in the network,
but didnt want to create a proper protocol to maintain it, so the end devices get burdened with
periodic state refresh messages, and the transit devices spend CPU cycles refreshing the state. RSVP
is a typical example, and everyone running large-scale MPLS/TE networks simply loves the periodic
refresh messages sent by tunnel head-ends they keep the core routers processing them cozily
warm.
Not surprisingly, RFC 3429 (Some Internet Architectural Guidelines and Philosophy) gives
you similar advice, although in way more eloquent form.
First, lets put the 4000 flows number in perspective. Its definitely a bit better than what current
commodity switches can do (for vendors trying to keep mum about their OpenFlow limitations,
check their ACL sizes flow entries would use the same TCAM), but NEC had 64.000+ flows on the
PF5240 years ago and Enterasys has 64 million flows per box with their CoreFlow2 technology.
Judge for yourself whether 4000 flows is such a major step forward.
Now lets focus on whether 4000 flows is enough. As always, the answer depends on the use case,
network size and implementation details. This blog post will focus on the last part.
The OpenFlow-based network trying to get feature parity with low-cost traditional ToR switches
should support
Well focus on a single layer-2 segment (you really dont want to get me started on the complexities
of scalable OpenFlow-based layer-3 forwarding) implemented on a single hardware switch. Our
segment will have two web servers (port 1 and 2), a MySQL server (port 3), and a default gateway
on port 4.
The default gateway could be a firewall, a router, or a load balancer it really doesnt
matter if we stay focused on layer-2 forwarding.
Smart switches wouldnt store the MAC-only flow rules in TCAM; they would use other
forwarding structures available in the switch like MAC hash tables.
The number of TCAM entries needed to support multi-tenant layer-2 forwarding has exploded:
By now youve probably realized what happens when you try to combine the input ACL with other
forwarding rules. The OpenFlow controller has to generate a Cartesian product of all three
requirements: the switch needs a flow entry for every possible combination of input port, ACL entry
and destination MAC address.
OpenFlow 1.1 (and later versions) have a concept of tables - independent ookup tables that can be
chained in any way you wish (further complicating the life of hardware vendors).
This is how you could implement our requirements with switches supporting OpenFlow 1.3:
Table #1 ACL and tenant classification table. This table would match input ports (for tenant
classification) and ACL entries, drop the packets not matched by input ACLs, and redirect the
forwarding logic to correct per-tenant table.
Table #2 .. #n per-tenant forwarding tables, matching destination MAC addresses and
specifying output ports.
The first table could be further optimized in networks using the same (overly long) access
list on numerous ports. That decision could also be made dynamically by the OpenFlow
controller.
A typical switch would probably have to implement the first table with a TCAM. All the other tables
could use the regular MAC forwarding logic (MAC forwarding table is usually orders of magnitude
bigger than TCAM). Scalability problem solved.
One would expect virtual switches to fare better. That doesnt seem to be the case.
The user-mode daemon would then perform packet lookup (using OpenFlow forwarding entries or
any other forwarding algorithm) and install a microflow entry for the newly discovered flow in the
kernel module.
Third parties (example: Midokura Midonet) use Open vSwitch kernel module in combination
with their own user-mode agent to implement non-OpenFlow forwarding architectures.
If youre old enough to remember the Catalyst 5000, youre probably getting unpleasant flashbacks
of Netflow switching but the problems we experienced with that solution must have been caused
by poor hardware and underperforming CPU, right? Well, it turns out virtual switches dont fare
much better.
Digging deep into the bowels of Open vSwitch reveals an interesting behavior: flow eviction. Once
the kernel module hits the maximum number of microflows, it starts throwing out old flows. Makes
perfect sense after all, thats how every caching system works until you realize the default limit
is 2500 microflows, which is barely good enough for a single web server and definitely orders of
magnitude too low for a hypervisor hosting 50 or 100 virtual machines.
I wasnt able to figure out what the underlying root cause is, but Im suspecting it has to do with
per-flow accounting flow counters have to be transferred from the kernel module to the user-mode
daemon periodically. Copying hundreds of thousands of flow counters over a user-to-kernel socket
at short intervals might result in somewhat noticeable CPU utilization.
Not surprisingly, no other virtual switch uses microflow-based forwarding. VMware vSwitch, Ciscos
Nexus 1000V and IBMs 5000V make forwarding decisions based on destination MAC addresses,
Hyper-V and Contrail based on destination IP addresses, and even VMware NSX for vSphere uses
distributed vSwitch and in-kernel layer-3 forwarding module.
Once the input queue of a packet forwarding process becomes non-empty, the operating system
schedules it. When there are no higher-priority processes ready to be run, the operating system
performs a context switch to the packet forwarding process.
When the packet forwarding process wakes up, it reads the next entry from its input queue,
performs destination address lookup and numerous other functions that might be configured on
input and output interfaces (NAT, ACL ...), and sends the packet to the output interface queue.
Not surprisingly, this mechanism is exceedingly slow ... and Cisco IOS is not the only operating
system struggling with that just ask anyone who tried to run high-speed VPN tunnels implemented
in Linux user mode processes on SOHO routers.
Theres not much you can do to speed up ACLs (which have to be read sequentially) and NAT is
usually not a big deal (assuming the programmers were smart enough to use hash tables).
Destination address lookup might be a real problem, more so if you have to do it numerous times
(example: destination is a BGP route with BGP next hop based on static route with next hop learnt
from OSPF). Welcome to fast switching.
Fast switching is a reactive cache-based IP forwarding mechanism. The address lookup within the
interrupt handler uses a cache of destinations to find the IP next hop, outgoing interface, and
outbound layer-2 header. If the destination is not found in the fast switching cache, the packet is
punted to the IP(v6) Input process, which eventually performs full-blown destination address lookup
(including ARP/ND resolution) and stores the results in the fast switching cache.
Fast switching worked great two decades ago (there were even hardware implementations of fast
switching) ... until the bad guys started spraying the Internet with vulnerability scans. No caching
code works well with miss rates approaching 100% (because every packet is sent to a different
destination) and very high cache churn (because nobody designed the cache to have 100.000 or
more entries).
When faced with a simple host scanning activity, routers using fast switching in combination with
high number of IP routes (read: Internet core routers) experienced severe brownouts because most
CEF switching (or Cisco Express Forwarding) is a proactive, deterministic IP forwarding mechanism.
Routing table (RIB) as computed by routing protocols is copied into forwarding table (FIB), where
its combined with adjacency information (ARP or ND table) to form a deterministic lookup table.
When a router uses CEF switching, theres (almost) no need to punt packets sent to unknown
destinations to IP Input process; if a destination is not in the FIB, it does not exist.
There are still cases where CEF switching cannot do its job. For example, packets sent to IP
addresses on directly connected interfaces cannot be sent to destination hosts until the router
performs ARP/ND MAC address resolution; these packets have to be sent to the IP Input process.
The directly connected prefixes are thus entered as glean adjacencies in the FIB, and as the router
learns MAC address of the target host (through ARP or ND reply), it creates a dynamic host route in
the FIB pointing to the adjacency entry for the newly-discovered directly-connected host.
Actually, you wouldnt want to send too many packets to the IP Input process; its better to create
the host route in the FIB (pointing to the bit bucket, /dev/null or something equivalent) even before
the ARP/ND reply is received to ensure subsequent packets sent to the same destination are
dropped, not punted behavior nicely exploitable by ND exhaustion attack.
Its pretty obvious that the CEF table must stay current. For example, if the adjacency information is
lost (due to ARP/ND aging), the packets sent to that destination are yet again punted to the process
switching. No wonder the router periodically refreshes ARP entries to ensure they never expire.
Though there is separate control plane and separate data plane, it appears that there is
crossover from one to the other. Consider the scenario when flow tables are not
programmed and so the packets will be punted by the ingress switch to PFC. The PFC will
then forward these packets to the egress switch so that the initial packets are not
dropped. So in some sense: we are seeing packet traversing the boundaries of typical
data-plane and control-plane and vice-versa.
Hes absolutely right, and if the above description reminds you of fast and process switching youre
spot on. There really is nothing new under the sun.
OpenFlow controllers use one of the following two approaches to switch programming (more details
@ NetworkStatic):
Proactive flow table setup, where the controller downloads flow entries into the switches based
on user configuration (ex: ports, VLANs, subnets, ACLs) and network topology;
Even though I write about flow tables, dont confuse them with per-flow forwarding that Doug
Gourlay loves almost as much as I do. A flow entry might match solely on destination MAC address,
making flow tables equivalent to MAC address tables, or it might match the destination IP address
with the longest IP prefix in the flow table, making the flow table equivalent to routing table or FIB.
The controller must know the topology of the network and all the endpoint addresses (MAC
addresses, IP addresses or IP subnets) for the proactive (predictive?) flow setup to work. If youd
have an OpenFlow controller emulating OSPF or BGP router, it would be easy to use proactive flow
setup; after all, the IP routes never change based on the application traffic observed by the
switches.
However, most vendors marketing departments (with a few notable exceptions) think their gear
needs to support every bridging-abusing stupidity ever invented, from load balancing schemes that
work best with hubs to floating IP or MAC addresses used to implement high-availability solutions.
End result: the network has to support dynamic MAC learning, which makes OpenFlow-based
networks reactive nobody can predict when and where a new MAC address will appear (and its not
guaranteed that the first packet sent from the new MAC address will be an ARP packet), so the
Some bridges (lovingly called layer-2 switches) dont punt packets with unknown MAC addresses to
the CPU, but perform dynamic MAC address learning and unknown unicast flooding is in hardware...
but thats not how OpenFlow is supposed to work.
Within a single device the software punts packet from hardware (or interrupt) switching to
CPU/process switching, in a controller-based network the switches punt packet to the controller. Plus
a change, plus c'est la mme chose.
The weakest link in todays OpenFlow implementations (like NECs ProgrammableFlow) is not the
controller, but the dismal CPU used in the hardware switches. The controller could handle millions
packets per second (thats the flow setup rate claimed by Floodlight developers), the switches
usually burn out at thousands of flow setups per second.
The CoPP function thus has to be implemented in the OpenFlow switches (like its implemented in
linecard hardware in traditional switches), and thats where the problems start OpenFlow doesnt
have a usable rate-limiting functionality till version 1.3, which added meters.
OpenFlow meters are a really cool concept they have multiple bands, and you can apply either
DSCP remarking or packet dropping at each band that would allow an OpenFlow controller to
closely mimic the CoPP functionality and apply different rate limits to different types of control- or
punted traffic.
If you want to know more details, I would strongly suggest you browse through the IP Fast Reroute
Applicability presentation Pierre Francois had @ EuroNOG 2011. To summarize what he told us:
Its relatively easy to fine-tune OSPF or IS-IS and get convergence times in tens of milliseconds.
SPF runs reasonably fast on modern processors, more so with incremental SPF optimizations.
A platform using software-based switching can use the SPF results immediately (thus theres no
real need for LFA on a Cisco 7200).
The true bottleneck is the process of updating distributed forwarding tables (FIBs) from the IP
routing table (RIB) on platforms that use hardware switching. That operation can take a
relatively long time if you have to update many prefixes.
PIC was first implemented for BGP (you can find more details, including interesting discussions of
FIB architectures, in another presentation Pierre Francois had @ EuroNOG), which usually carries
hundreds of thousands of prefixes that point to a few tens of different next hops. It seems some
Service Providers carry way too many routes in OSPF or IS-IS, so it made sense to implement LFA
for those routing protocols as well.
In its simplest form, BGP PIC goes a bit beyond exiting EBGP/IBGP multipathing and copies
backup path information into RIB and FIB. Distributing alternate paths throughout the
network requires numerous additional tweaks, from modified BGP path propagation rules to
modified BGP route reflector behavior.
NEC, the only company Im aware of that has production-grade OpenFlow deployments and is willing
to talk about them admitted as much in their Networking Tech Field Day 2 presentation (watch the
ProgrammableFlow Architecture and Use Cases video around 12:00). Their particular
controller/switch combo can set up 600-1000 flows per switch per second (which is still way better
than what researchers using HP switches found and documented in the DevoFlow paper they found
the switches can set up ~275 flows per second).
Now imagine a core of a simple L2 network built from tens of switches and connecting hundreds of
servers and thousands of VMs. Using traditional L2 forwarding techniques, each switch would have
to know the MAC address of each VM ... and the core switches would have to update thousands of
entries after a link failure, resulting in multi-second convergence time. Obviously OpenFlow-based
networks need prefix-independent convergence (PIC) as badly as anyone else.
OpenFlow 1.0 could use flow matching priorities to implement primary/backup forwarding entries
and OpenFlow 1.1 provides a fast failover mechanism in its group tables that could be used for
prefix-independent convergence, but it's questionable how far you can get with existing hardware
devices, and PIC doesn't work in all topologies anyway.
Just in case youre wondering how existing L2 networks work at all data plane in high-
speed switches performs dynamic MAC learning and populates the forwarding table in
hardware; the communication between the control and the data plane is limited to the bare
minimum (which is another reason why implementing OpenFlow agents on existing switches
is like attaching a jetpack to a camel).
All the traffic that expects the same forwarding behavior gets the same label;
The intermediate nodes no longer have to inspect the individual packet/frame headers; they
forward the traffic solely based on the FEC indicated by the label.
The grouping/labeling operation thus greatly reduces the forwarding state in the core nodes (you
can call them P-routers, backbone bridges, or whatever other terminology you prefer) and improves
Figure 5-2: MPLS forwarding diagram from the Enterprise MPLS/VPN Deployment webinar
The core network convergence is improved due to reduced state not due to pre-computed
alternate paths that Prefix-Independent Convergence or MPLS Fast Reroute uses.
All sorts of tunneling mechanisms have been proposed to scale layer-2 broadcast domains and
virtualized networks (IP-based layer-3 networks scale way better by design):
Provider Backbone Bridges (PBB 802.1ah), Shortest Path Bridging-MAC (SPBM 802.1aq) and
vCDNI use MAC-in-MAC tunneling the destination MAC address used to forward user traffic
across the network core is the egress bridge or the destination physical server (for vCDNI).
Figure 5-3: SPBM forwarding diagram from the Data Center 3.0 for Networking Engineers webinar
VXLAN, NVGRE and GRE (used by Open vSwitch) use MAC-over-IP tunneling, which scales way
better than MAC-over-MAC tunneling because the core switches can do another layer of state
abstraction (subnet-based forwarding and IP prefix aggregation).
TRILL is closer to VXLAN/NVGRE than to SPB/vCDNI as it uses full L3 tunneling between TRILL
endpoints with L3 forwarding inside RBridges and L2 forwarding between RBridges.
Figure 5-5: TRILL forwarding diagram from the Data Center 3.0 for Networking Engineers webinar
Figure 5-6: MPLS-over-Ethernet frame format from the Enterprise MPLS/VPN Deployment webinar
THE PROBLEM
Contrary to what some pundits claim, flow-based forwarding will never scale. If youve been around
long enough to experience ATM-to-the-desktop failure, Multi-Layer Switching (MLS) kludges, demise
of end-to-end X.25, or the cost of traditional circuit switching telephony, you know what Im talking
about. If not, supposedly its best to learn from your own mistakes be my guest.
Before someone starts Moore Law incantations: software-based forwarding will always be more
expensive than predefined hardware-based forwarding. Yes, you can push tens of gigabits through a
highly optimized multi-core Intel server. You can also push 1,2Tbps through Broadcom chipset at
SCALABLE ARCHITECTURES
The scalability challenges of flow-based forwarding have been well understood (at least within IETF,
ITU is living on a different planet) decades ago. Thats why we have destination-only forwarding,
variable-length subnet masks and summarization, and Diffserv (with a limited number of traffic
classes) instead of Intserv (with per-flow QoS).
The limitations of destination-only hop-by-hop forwarding were also well understood for at least two
decades and resulted in MPLS architecture and various MPLS-based applications (including MPLS
Traffic Engineering).
Theres a huge difference between MPLS TE forwarding mechanism (which is the right tool for the
job), and distributed MPLS TE control plane (which sucks big time). Traffic engineering is ultimately
an NP-complete knapsack problem best solved with centralized end-to-end visibility.
MPLS architecture solves the forwarding rigidity problems while maintaining core network scalability
by recognizing that while each flow might be special, numerous flows share the same forwarding
behavior.
Edge MPLS routers (edge LSR) thus sort the incoming packets into forwarding equivalence classes
(FEC), and use a different Label Switched Path (LSP) across the network for each of the forwarding
classes.
The simplest classification implemented in all MPLS-capable devices today is destination prefix-based
classification (equivalent to traditional IP forwarding), but theres nothing in MPLS architecture that
would prevent you from using N-tuples to classify the traffic based on source addresses, port
numbers, or any other packet attribute (yet again, ignoring the reality of having to use PBR with the
infinitely disgusting route-map CLI to achieve that).
Also, after more than a decade of tinkering, the vendor MPLS implementations leave a lot to be
desired. If you hate a particular vendors CLI or implementation kludges, blame them, not the
technology.
Its hard to build resilient networks with centralized control plane and unreliable transport
between the controller and controlled devices (this problem was well known in the days of Frame
Relay and ATM);
You must introduce layers of abstraction in order to scale the network.
Martin Casado, Teemu Koponen, Scott Shenker and Amin Tootoonchian addressed the second
challenge in their Fabric: A Retrospective on Evolving SDN paper, where they propose two layers in
an SDN architectural framework:
Edge switches, which classify the packets, perform network services, and send the packets
across core fabric toward the egress edge switch;
Core fabric, which provides end-to-end transport.
Not surprisingly, theyre also proposing to use MPLS labels as the fabric forwarding mechanism.
As explained above, MPLS edge routers classify ingress packets into FECs, and attach a label
signifying the desired treatment to each of the packet. The original packet is not changed in any
way; any intermediate node can get the raw packet content if needed.
NAT, on the other hand, always changes the packet content (at least the layer-3 addresses,
sometimes also layer-4 port numbers), or it wouldnt be NAT.
NAT breaks transparent end-to-end connectivity, MPLS doesnt. MPLS is similar to lossless
compression (ZIP), NAT is similar to lossy compression (JPEG). Do I need to say more?
The (somewhat nuanced) issue I would raise is that [...] decoupling [also] allows
evolving the edge and core separately. Today, changing the edge addressing scheme
requires a wholesale upgrade to the core.
The 6PE architecture (IPv6 on the edge, MPLS in the core) is a perfect example of this concept.
In IP-only networks, the core and access routers (aka layer-3 switches) share the same forwarding
mechanism (ignoring the option of having default routing in the access layer); if you want to
On the other hand, you can introduce IPv6, IPX or AppleTalk (not really), or anything else in an
MPLS network, without upgrading the core routers. The core routers continue to provide a single
function: optimal transport based on MPLS paths signaled by the edge routers (either through LDP,
MPLS-TE, MPLS-TP or more creative approaches, including NETCONF-configured static MPLS labels).
The same ideas apply to OpenFlow-configured networks. The edge devices have to be smart and
support a rich set of flow matching and manipulation functionality; the core (fabric) devices have to
match on simple packet tags (VLAN tags, MAC addresses with PBB encapsulation, MPLS tags ...) and
provide fast packet forwarding.
Before you mention (multicast-based) VXLAN in the comments: I fail to see something software-
defined in a technology that uses flooding to learn dynamic VM-MAC-to-VTEP-IP mappings.
The following blog post was written in February 2012; in summer 2014 I inserted a few comments
to illustrate how we got nowhere in more than two years.
After the GRE tunnels have been created, they appear as regular interfaces within the Open
vSwitch; an OpenFlow controller can use them in flow entries to push user packets across GRE
tunnels to other hypervisor hosts.
We will probably see VXLAN/NVGRE/GRE implementations in data center switches in the next few
months, but I expect most of those implementations to be software-based and thus useless for
anything else but a proof-of-concept (August 2014: no major data center switching vendor supports
OpenFlow over any tunneling technology).
Cisco already has VXLAN-capable chipset in the M-series linecards; believers in merchant silicon will
have to wait for the next-generation chipsets (August 2014: Broadcoms and Intels chipsets support
VXLAN, but so far no vendor shipped VXLAN termination that would work with OpenFlow).
VLAN stacking was also introduced in OpenFlow 1.1. While it would be a convenient labeling
mechanism (similar to SPBV, but with a different control plane), many data center switches dont
support Q-in-Q (802.1ad). No VLAN stacking today.
The only standard labeling mechanism left to OpenFlow-enabled switches is thus VLAN tagging
(OpenFlow 1.0 supports VLAN tagging, VLAN translation and tag stripping). You could use VLAN tags
to build virtual circuits across the network core (similar to what MPLS labels do) and the source-
THE REALITY
I had the virtual circuits discussion with multiple vendors during the OpenFlow symposium and
Networking Tech Field Day and we always came to the same conclusions:
Someone was also kind enough to give me a hint that solved the secret awesomesauce riddle: We
can use any field in the frame header in any way we like.
Looking at the OpenFlow 1.0 specs (assuming no proprietary extensions are used) you can rewrite
source and destination MAC addresses to indicate whatever you wish you have 96 bits to work
with. Assuming the hardware devices support wildcard matches on MAC addresses (either by
supporting OpenFlow 1.1 or a proprietary extension to OpenFlow 1.0), you could use the 48 bits of
the destination MAC address to indicate egress node, egress port, and egress MAC address.
SUMMARY
Before buying an OpenFlow-based data center network, figure out what the vendors are doing (they
will probably ask you to sign an NDA, which is fine), including:
What are the mechanisms used to reduce forwarding state in the OpenFlow-based network core?
Whats the actual packet format used in the network core (or: how are the fields in the packet
header really used?)
Will you be able to use standard network analysis tools to troubleshoot the network?
Which version of OpenFlow are they using?
Which proprietary extensions are they using (or not using)?
Which switch/controller combinations are tested and fully supported?
You can talk about tunneling when a protocol that should be lower in the protocol stack gets
encapsulated in a protocol that youd usually find above or next to it. MAC-in-IP, IPv6-in-IPv4, IP-
over-GRE-over-IP, MAC-over-VPLS-over-MPLS-over-GRE-over-IPsec-over-IP ... these are tunnels.
IP-over-MPLS-over-PPP/Ethernet is not tunneling, just like IP-over-LLC1-over-TokenRing or IP-over-
X.25-over-LAPD wasnt.
It is true, however, that MPLS uses virtual circuits, but they are not identical to tunnels. Just
because all packets between two endpoints follow the same path and the switches in the middle
dont inspect their IP headers, doesnt mean you use a tunneling technology.
One-label MPLS is (almost) functionally equivalent to two well-known virtual circuit technologies:
ATM or Frame Relay (that was also its first use case). However, MPLS-based networks scale better
than those using ATM or Frame Relay because of two major improvements:
Automatic setup of virtual circuits based on network topology (core IP routing information), both
between the core switches and between the core (P-routers) and edge (PE-routers) devices. Unless
VC merge: Virtual circuits from multiple ingress points to the same egress point can merge within
the network. VC merge significantly reduces the overall number of VCs (and the amount of state the
core switches have to keep) in fully meshed networks.
Its interesting to note that ITU wants to cripple MPLS to the point of being equivalent to
ATM/Frame Relay. MPLS-TP introduces out-of-band management network and management
plane-based virtual circuit establishment.
DOES IT MATTER?
It might seem like Im splitting hair just for the fun of it, but theres a significant scalability
difference between virtual circuits and tunnels: devices using tunnels appear as hosts to the
underlying network and require no in-network state, while solutions using virtual circuits (including
MPLS) require per-VC state entries (MPLS: inbound-to-outbound label mapping in LFIB) on every
forwarding device in the path. Even more, end-to-end virtual circuits (like MPLS TE) require state
maintenance (provided by periodic RSVP signaling in MPLS TE) involving every single switch in the
VC path.
You can find scalability differences even within the MPLS world: MPLS/VPN-over-mGRE (tunneling)
scales better than pure label-based MPLS/VPN (virtual circuits) because MPLS/VPN-over-mGRE relies
on IP transport and not on end-to-end LSPs between PE-routers. You can summarize loopback
addresses if you use MPLS/VPN-over-mGRE; doing the same in end-to-end-LSP-based MPLS/VPN
networks breaks them. L2TPv3 scales better than AToM for the same reason.
The only global networks using on-demand virtual circuits were the telephone system and X.25; one
of them already died because of its high per-bit costs, and the other one is surviving primarily
because were replacing virtual circuits (TDM voice calls) with tunnels (VoIP).
TANGENTIAL AFTERTHOUGHTS
Dont be sloppy with your terminology. Theres a reason we use different terms to indicate different
behavior it helps us understand the implications (ex: scalability) of the technology. For example,
its important to understand why bridging differs from routing and why its wrong to call them both
switching, and it helps if you understand that Fibre Channel actually uses routing (hidden deep
inside switching terminology).
So far, SDN is relying or stressing mainly the L2-L3 network programmability (switches
and routers). Why are most of the people not mentioning L4-L7 network services such as
firewalls or ADCs. Why would those elements not have to be SDNed with an OpenFlow
support for instance?
To understand the focus on L2/L3 switching, lets go back a year and a half to the laws-of-physics-
changing big bang event.
The main proponents of OpenFlow/SDN (in the Open Networking Foundation sense) are still the
Googles of the world and what they want is the ability to run their own control-plane on top of
commodity switching hardware. They don't care that much about L4-7 appliances, or people whod
What is your take on the performance issue with software-based equipment when dealing
with general purpose CPU only? Do you see this challenge as a hard stop to SDN
business?
Short answer (as always) is it depends. However, I think most people approach this issue the wrong
way.
First, lets agree that SDN means programmable networks (or more precisely, network elements
that can be configured through a reasonable and documented API), not the Open Networking
Foundations self-serving definition.
Second, I hope we agree it makes no sense to perpetuate the existing spaghetti mess we have in
most data centers. Its time to decouple content and services from the transport, decouple virtual
networks from the physical transport, and start building networks that provide equidistant endpoints
(in which case it doesnt matter to which port a load balancer or firewall is connected).
Implementing fast (and dumb) packet forwarding on L2 (bridge) or L3 (router) on generic x86
hardware makes no sense. It makes perfect sense to implement the control plane on generic x86
hardware (almost all switch vendors use this approach) and generic OS platform, but it definitely
doesnt make sense to let the x86 CPU get involved with packet forwarding. Broadcom's chipset
can do a way better job for less money.
L4-7 services are usually complex enough to require lots of CPU power anyway. Firewalls
configured to perform deep packet inspection and load balancers inspecting HTTP sessions must
process the first few packets of every session by the CPU anyway, and only then potentially
offload the flow record to dedicated hardware. With optimized networking stacks, its possible to
get reasonable forwarding performance on well-designed x86 platforms, so theres little reason
to use dedicated hardware in L4-7 appliances today (SSL offload is still a grey area).
On top of everything else, shortsighted design of dedicated hardware used by L4-7 appliances
severely limits your options. Just ask a major vendor that needed years to roll out IPv6-enabled load
balancers, high-performance IPv6-enabled firewalls blade ... and still doesnt have hardware-based
deep packet inspection of IPv6 traffic.
Before clicking Read more, watch this video and try to figure out what the solution is and why were
not using it in large-scale networks.
The proposal is truly simple: it uses anycast with per-flow forwarding. All servers have the same IP
address, and the OpenFlow controller establishes a path from each client to one of the servers. In its
most simplistic implementation, a flow entry is installed in all devices in the path every time a client
establishes a session with a server (you could easily improve it by using MPLS LSPs or any other
virtual circuit/tunneling mechanism in the core).
Now ask yourself: will this ever scale? Of course it wont. It might be a good solution for long-lived
sessions (after all, thats how voice networks handle 800-numbers), but not for the data world
where a single client could establish tens of TCP sessions per second.
Load balancers work as well as they do because a single device in the whole path (load balancer)
keeps the per-session state, and because you can scale them out if they become overloaded, you
just add another pair of redundant devices with new IP addresses to the load balancing pool (and
use DNS-based load balancing on top of them).
Some researchers have quickly figured out the scaling problem and theres work being done to make
the OpenFlow-based load balancing scale better, but one has to wonder: after theyre done and their
solution scales, will it be any better than what we have today, or will it just be different?
Moral of the story every time you hear about an incredible solution to a well-known problem ask
yourself: why werent we using it in the past? Were we really that stupid or are there some inherent
limitations that are not immediately visible? Will it scale? Is it resilient? Will it survive device or link
failures? And dont forget: history is a great teacher.
More complex challenges (example: traffic engineering) have been solved using the traditional
architecture of distributed loosely coupled independent nodes (example: MPLS TE), but could benefit
from a centralized network visibility.
Finally, the traditional solutions havent even tried to tackle some of the harder networking problems
(example: megaflow-based forwarding or centralized policies with on-demand deployment) that
could be solved with a controller-based architecture.
MORE INFORMATION
Youll find additional SDN- and OpenFlow-related information on ipSpace.net web site:
This material is copyrighted and licensed for the sole use by Dimitar Stojanovski (dimitar.s@gmail.com [164.143.240.34]). More information at http://www.ipSpace.net/Webinars
This chapter contains several real-life SDN solutions, most of them OpenFlow-based. For alternate
approaches see the SDN Beyond OpenFlow chapter, for even more use cases watch the publicly
available videos from my OpenFlow-based SDN Use Cases webinar.
IN THIS CHAPTER:
OPENFLOW: ENTERPRISE USE CASES
OPENFLOW @ GOOGLE: BRILLIANT, BUT NOT REVOLUTIONARY
COULD IXPS USE OPENFLOW TO SCALE?
IPV6 FIRST-HOP SECURITY: IDEAL OPENFLOW USE CASE
OPENFLOW: A PERFECT TOOL TO BUILD SMB DATA CENTER
SCALING DOS MITIGATION WITH OPENFLOW
NEC+IBM: ENTERPRISE OPENFLOW YOU CAN ACTUALLY TOUCH
BANDWIDTH-ON-DEMAND: IS OPENFLOW THE SILVER BULLET?
OPENSTACK/QUANTUM SDN-BASED VIRTUAL NETWORKS WITH FLOODLIGHT
NICIRA, BIGSWITCH, NEC, OPENFLOW AND SDN
Leaving aside the pretentious claims how OpenFlow will solve hard problems like global load
balancing, there are four functions you can easily implement with OpenFlow (Tony Bourke wrote
about them in more details):
Combine that with the ephemeral nature of OpenFlow (whatever controller downloads into the
networking device does not affect running/startup configuration and disappears when its no longer
Actually, I dont care if the mechanism to change networking devices forwarding tables is OpenFlow
or something completely different, as long as its programmable, multi-vendor and integrated with
the existing networking technologies. As I wrote a number of times, OpenFlow is just a
TCAM/FIB/packet classifier download tool.
Remember one of OpenFlows primary use cases: add functionality where vendor is lacking it (see
Igor Gashinskys presentation from OpenFlow Symposium for a good coverage of that topic).
Now stop for a minute and remember how many times you badly needed some functionality along
the lines of the four functions I mentioned above (packet filters, PBR, static routes, NAT) that you
couldnt implement at all, or that required a hodgepodge of expect scripts (or XML/Netconf requests
if youre Junos automation fan) that you have to modify every time you deploy a different device
type or a different software release.
Here are a few ideas I got in the first 30 seconds (if you get other ideas, please do write a
comment):
This is a work of fiction, based solely on the publicly available information presented by
Googles engineers at Open Networking Summit (plus an interview or two published by the
industry press). Read and use it at your own risk.
On top of that, every G-router has a (proprietary, I would assume) northbound API that is used by
Googles Traffic Engineering (G-TE) a centralized application thats analyzing the application
requirements, computing the optimal paths across the network and creating those paths through the
network of G-routers using the above-mentioned API.
I wouldnt be surprised if G-TE would use MPLS forwarding instead of installing 5-tuples into mid-
path switches. Doing Forwarding Equivalence Class (FEC) classification at the head-end device
instead of at every hop is way simpler and less loop-prone.
Like MPLS-TE, G-TE runs in parallel with the traditional routing protocols. If it fails (or an end-to-end
path is broken), G-routers can always fall back to traditional BGP+IGP-based forwarding, and like
with MPLS-TE+IGP, youll still have a loop-free (although potentially suboptimal) forwarding
topology.
IS IT SO DIFFERENT?
Not really. Similar concepts (central path computation) were used in ATM and Frame Relay
networks, as well as early MPLS-TE implementations (before Cisco implemented OSPF/IS-IS traffic
engineering extensions and RSVP that was all youd had).
Some networks are supposedly still running offline TE computations and static MPLS TE tunnels
because they give you way better results than the distributed MPLS-TE/autobandwidth/automesh
kludges.
You could do the same thing (should you wish to do it) with the traditional gear using NETCONF with
a bit of MPLS-TP sprinkled on top (or your own API if you have switches that can be easily
programmed in a decent programming language Arista immediately comes to mind), but it would
be a slight nightmare and would still suffer the drawbacks of distributed signaling protocols (even
static MPLS-TE tunnels use RSVP these days).
The true difference between their implementation and everything else on the market is thus that
they did it the right way, learning from all the failures and mistakes we made in the last two
decades.
Even though they had to make hefty investment in the G-router platform, they claim their network
already converges almost 10x faster than before (on the other hand, its not hard converging faster
HYPE GALORE
Based on the information from Open Networking Summit (which is all the information I have at the
moment), you might wonder what all the hype is about. In one word: OpenFlow. Lets try to debunk
those claims a bit.
Google is running an OpenFlow network. Get lost. Google is using OpenFlow between controller and
adjacent chassis switches because (like everyone else) they need a protocol between the control
plane and forwarding planes, and they decided to use an already-documented one instead of
inventing their own (the extra OpenFlow hype could also persuade hardware vendors and chipset
manufacturers to implement more OpenFlow capabilities in their next-generation products).
Google built their own routers ... and so can you. Really? Based on the scarce information from ONS
talks and interview in Wired, Google probably threw more money and resources at the problem than
a typical successful startup. They effectively decided to become a router manufacturer, and they did.
Can you repeat their feat? Maybe, if you have comparable resources.
Google used open-source software ... so the monopolistic Ciscos of the world are doomed. Just in
case you believe the fairy-tale conclusion, let me point out that many Internet exchanges use open-
source software for BGP route servers, and almost all networking appliances and most switches built
today run on open source software (namely Linux or FreeBSD). Its the added value that matters, in
Googles case their traffic engineering solution.
CONCLUSIONS
Googles engineers did a great job it seems they built a modern routing platform that everyone
would love to have, and an awesome traffic engineering application. Does it matter to you and me?
Probably not; I dont expect them giving their crown jewels away. Does it matter that they used
OpenFlow? Not really, its a small piece of their whole puzzle. Will someone else repeat their feat
and bring a low-cost high-end router to the market? I doubt, but I hope to be wrong.
Internet Exchange Points (IXP) seemed a perfect fit they are high-speed mission-critical
environments usually implemented as geographically stretched layer-2 networks, and facing all sorts
of security and scaling problems. Deploying OpenFlow on IXP edge switches would results in
standardized security posture that wouldnt rely on idiosyncrasies of particular vendors
implementation, and we could use OpenFlow to implement ARP sponge (or turn ARPs into unicasts
sent to ARP server).
I presented these ideas at MENOG 12 in March 2013 and got a few somewhat interested responses
and then I asked a really good friend with significant operational experience in IXP environments
for feedback. Not surprisingly, the reply was a cold shower:
I am not quite sure how this improves current situation. Except for the ARP sponge
everything else seem to be implemented by vendors in one form or another. For the ARP
sponge, AMS-IX uses great software developed in house that theyve open-sourced.
As always, from the ops perspective proven technologies beat shiny new tools.
A note from a grumpy skeptic: his deployment works great because hes carrying a pretty
limited number of BGP routes the Pica8 switches hes using support up to 12K routes.
IPv4 or IPv6? Who knows, the data sheet ignores that nasty detail.
SHORT SUMMARY
Many layer-2 switches still lack the feature parity with IPv4;
IPv6 uses three address allocation algorithms (SLAAC, privacy extensions, DHCPv6) and its quite
hard to enforce a specific one;
Host implementations are wildly different (aka: The nice thing about standards is that you have
so many to choose from.).
IPv6 address tracking is a hodgepodge of kludges.
Whenever a new end-host appears on the network, its authenticated, and its MAC address is
logged. Only that MAC address can be used on that port (many switches already implement this
functionality).
Whenever an end-host starts using a new IPv6 source address, the packets are not matched by
any existing OpenFlow entries and thus get forwarded to the OpenFlow controller.
The OpenFlow controller decides whether the new source IPv6 is legal (enforcing DHCPv6-only
address allocation if needed), logs the new IPv6-to-MAC address mapping, and modifies the flow
entries in the first-hop switch. The IPv6 end-host can use many IPv6 addresses each one of
them is logged immediately.
Ideally, if the first-hop switches support all the nuances introduced in OpenFlow 1.2, the
controller can install neighbor advertisement (NA) filters, effectively blocking ND spoofing.
Will this nirvana appear anytime soon? Not likely. Most switch vendors support only OpenFlow 1.0,
which is totally IPv6-ignorant. Also, solving real-life operational issues is never as sexy as promoting
the next unicorn-powered fountain of youth.
As of August 2014, NEC is still the only vendor with a commercial-grade data center fabric product
using OpenFlow. Most other vendors use more traditional architectures, and the virtualization world
is quickly moving toward overlay virtual networks.
Anyhow, this is how I envisioned potential OpenFlow use in a small data center in 2012:
Once the networking vendors figure out the fine details, they could use dedicated management
ports for out-of-band OpenFlow control plane (similar to what QFabric is doing today), DHCP to
assign an IP address to the switch, and a new DHCP option to tell the switch where the controller is.
The DHCP server would obviously run on the OpenFlow controller, and the whole control plane
infrastructure would be completely isolated from the outside world, making it pretty secure.
The extra hardware cost for significantly reduced complexity (no per-switch configuration and a
single management/SNMP IP address): two dumb 1GE switches (to make the setup redundant),
hopefully running MLAG (to get rid of STP).
Finally, assuming server virtualization is the most common use case in a SMB data center, you could
tightly couple OpenFlow controller with VMwares vCenter, and let vCenter configure the whole
network:
You could easily build a GE Clos fabric using switches from NEC America: PF5240 (ToR switch) as
leaf nodes (youd have almost no oversubscription with 48 GE ports and 4 x 10GE uplinks), and
PF5820 (10 GE switch) as spine nodes and interconnection point with the rest of the network.
Using just two PF5820 spine switches you could get over 1200 1GE server ports enough to connect
200 to 300 servers (probably hosting anywhere between 5.000 and 10.000 VMs).
You'd want to keep the number of switches controlled by the OpenFlow controller low to avoid
scalability issues. NEC claims they can control up to 200 ToR switches with a controller cluster; I
would be slightly more conservative.
If you want true converged storage with DCB, you have to use IBMs switches (NEC does not
have DCB), and even then Im not sure how DCB would work with OpenFlow.
Anyhow, assuming all the bumps eventually do get ironed out, you could have a very easy-to-
manage network connecting a few hundred 10GE-attached servers.
A few companies have all the components one would need in a SMB data center (Dell, HP, IBM), and
Dell just might be able to pull it off (while HP is telling everyone how theyll forever change the
networking industry). And now that Ive mentioned Dell: how about configuring your data center
through a user-friendly web interface, and have it shipped to your location in a few weeks.
A curious mind obviously wants to know whats behind the scenes. Masterpieces of engineering?
Large integration projects ... or is it just a smart application of API glue? In most cases, its the
latter. Lets look at the ProgrammableFlow Radware integration.
Heres a slide from NECs white paper. An interesting high-level view, but no details. Radware press
release is even less helpful (but its definitely a masterpiece of marketing).
DefenseFlow software monitors the flow entries and counters provided by an OpenFlow
controller, and tries to identify abnormal traffic patterns;
The abnormal traffic is diverted to Radware DefensePro appliance that scrubs the traffic before
its returned to the data center.
Both operations are easily done with ProgrammableFlow API it provides both flow data and the
ability to redirect the traffic to a third-party next hop (or MAC address) based on a dynamically-
configured access list. Heres a CLI example from the ProgrammableFlow webinar; API call would be
very similar (but formatted as JSON or XML object):
Doing initial triage and subsequent traffic blackholing in cheaper forwarding hardware (programmed
through OpenFlow) and diverting a small portion of the traffic through the scrubbing appliance
significantly improves the average bandwidth a DPI solution can handle at reasonable cost.
However, an OpenFlow controller does provide a more abstracted API instead of configuring PBR
entries that push traffic toward next hop (or an MPLS TE tunnel if youre an MPLS ninja) and
modifying router configuration while doing so, you just tell the OpenFlow controller that you want
the traffic redirected toward a specific MAC address, and the necessary forwarding entries
automagically appear all across the path.
Finally, theres the sexiness factor. Mentioning SDN instead of Netflow or PBR in your press release
is infinitely more attractive to bedazzled buyers.
However, a well-tuned solution using the right combination of hardware and software (example:
NECs PF5240 which can handle 160.000 L2, IPv4 or IPv6 flows in hardware) just might work. Still,
were early in the development cycle, so make sure you do thorough (stress) testing before buying
anything ... and just in case you need rock-solid traffic generator, Spirent will be more than happy
to sell you one (or few).
A BIT OF A BACKGROUND
Tervelas data fabric solutions typically run on top of traditional networking infrastructure, and an
underperforming network (particularly long outages triggered by suboptimal STP implementations)
can severely impact the behavior of the services running on their platform.
They were looking for a solution that would perform way better than what their customers are
typically using today (large layer-2 networks), while at the same time being easy to design,
EASY TO DEPLOY?
As long as your network is not too big (NEC claimed their controller can manage up to 50 switches in
their Networking Tech Field Day presentation, and the later releases of ProgrammableFlow increased
that limit to 200), the design and deployment isnt too hard according to Tervelas engineers:
They decided to use out-of-band management network and connected the management port of
BNT8264 to the management network (they could also use any other switch port).
All you have to configure on the individual switch is the management VLAN, a management IP
address and the IP address of the OpenFlow controllers.
The ProgrammableFlow controller automatically discovers the network topology using LLDP
packets sent from the controller through individual switch interfaces.
After those basic steps, you can start configuring virtual networks in the OpenFlow controller
(see the demo NEC made during the Networking Tech Field Day).
Obviously, youd want to follow some basic design rules, for example:
Make the management network fully redundant (read the QFabric documentation to see how
thats done properly);
Connect the switches into a structure somewhat resembling a Clos fabric, not in a ring or a
random mess of cables.
They found out that (as expected) the first packet exchanged between a pair of VMs experiences a
8-9 millisecond latency because its forwarded through the OpenFlow controller, with subsequent
packets having latency they were not able to measure (their tool has a 1 msec resolution).
Lesson#1 If the initial packet latency matters, use proactive programming mode (if available) to
pre-populate the forwarding tables in the switches;
Lesson#2 Dont do a full 12-tuple lookups unless absolutely necessary. Youd want to experience
the latency only when the inter-VM communication starts, not for every TCP/UDP flow (not to
mention that capturing every flow in a data center environment is a sure recipe for disaster).
If it takes 8-9 milliseconds for the controller to program a single flow into the switches (see
latency above), its totally impossible that the same controller would do a massive
reprogramming for the forwarding tables in less than a millisecond. The failure response
must have been preprogrammed in the forwarding tables.
NEC added OAM functionality in later releases of ProgrammableFlow, probably solving this
problem.
Finally, assuming their test bed allowed the ProgrammableFlow controller to prepopulate the backup
entries, it would be interesting to observe the behavior of a four-node square network, where its
impossible to find a loop-free alternate path unless you use virtual circuits like MPLS Fast Reroute
does.
They were able to reserve a link for high-priority traffic and observe automatic load balancing across
alternate paths (which would be impossible in a STP-based layer-2 network), but they were not able
to configure statistics-based routing (route important flows across underutilized links).
In the last 20 years, at least three technologies have been invented to solve the bandwidth-on-
demand problem: RSVP, ATM Switched Virtual Circuits (SVC) and MPLS Traffic Engineering (MPLS-
TE). None of them was ever widely used to create a ubiquitous bandwidth-on-demand service.
Im positive very smart network operators (including major CDN and content providers like Google)
use MPLS-TE very creatively. Im also sure there are environments where RSVP is a mission-critical
functionality. Im just saying bandwidth-on-demand is like IP multicast its used by 1% of the
networks that badly need it.
All three technologies I mentioned above faced the same set of problems:
You dont think the last bullet is real? Then tell me how many off-the-shelf applications have RSVP
support ... even though RSVP has been available in Windows and Unix/Linux server for ages. How
many applications can mark their packets properly? How many of them allow you to configure DSCP
value to use (apart from IP phones)?
Similarly, its not hard to implement bandwidth-on-demand for specific elephant flows (inter-DC
backup, for example) with a pretty simple combination of MPLS-TE and PBR, potentially configured
with Netconf (assuming you have a platform with a decent API). You could even do it with SNMP
pre-instantiate the tunnels and PBR rules and enable tunnel interface by changing ifAdminStatus.
When have you last seen it done?
So, although Im the first one to admit OpenFlow is an elegant tool to integrate flow classification
(previously done with PBR) with traffic engineering (using MPLS-TE or any of the novel technologies
proposed by Juniper) using the hybrid deployment model, being a seasoned skeptic, I just dont
believe well reach the holy grail of bandwidth-on-demand during this hype cycle. However, being an
eternal optimist, I sincerely hope Im wrong.
Now, what does that have to do with OpenFlow, SDN, Floodlight and Quantum?
Every old idea will be proposed again with a different name and a different presentation,
regardless of whether it works.
OpenStack virtual networks are created with the REST API of the Quantum (networking)
component of OpenStack;
Quantum uses back-end plug-ins to create the virtual networks in the actual underlying network
fabric. Quantum (and the rest of OpenStack) does not care how the virtual networks are
implemented as long as they provide isolated L2 domains.
Big Switch decided to implement virtual networks with dynamic OpenFlow-based L2 ACLs instead
of using VLAN tags.
The REST API offered by Floodlights VirtualNetworkFilter module offers simple methods that
create virtual networks and assign MAC addresses to them.
The VirtualNetworkFilter intercepts new flow setup requests (PacketIn messages to the Floodlight
controller), checks that the source and destination MAC address belong to the same virtual
network, and permits or drops the packet.
If the VirtualNetworkFilter accepts the flow, the Floodlights Forwarding module installs the flow
entries for the newly-created flow throughout the network.
The current release of Floodlight installs per-flow entries throughout the network. Im not
particularly impressed with the scalability of this approach (and Im not the only one).
The Floodlight controller is a single point of failure (theres no provision for a redundant
controller);
Unless I cant read Java code (which wouldnt surprise me at all), the VirtualNetworkFilter stores
all mappings (including MAC membership information) in in-memory structures that are lost if
the controller or the server on which it runs crashes;
As mentioned above, per-flow entries used by Floodlight controller dont scale at all (more about
that in an upcoming post).
The whole thing is thus a nice proof-of-concept tool that will require significant efforts (probably
including a major rewrite of the forwarding module) before it becomes production-ready.
However, we should not use Floodlight to judge the quality of the yet-to-be-released commercial
OpenFlow controller from Big Switch Networks. This is how Mike Cohen explained the differences:
I want to highlight that all of the points you raised around production deployability and
flow scalability (and some you didn't around how isolation is managed / enforced) are
indeed addressed in significant ways in our commercial products. Theres a separation
between what's in Floodlight and the code folks will eventually see from Big Switch.
As always, I might become a believer once I see the product and its documentation.
In the meantime, VMware bought Nicira (as I predicted in the last paragraph), and Niciras NVP
became the basis for VMwares NSX.
Target environment: Large cloud builders and other organizations leaning toward Xen/OpenStack.
NEC and BigSwitch are building virtual networks by rearranging the forwarding tables in the physical
switches. Their OpenFlow controllers are actively reconfiguring the physical network, creating virtual
networks out of VLANs, interfaces, or sets of MAC/IP addresses.
Deployment paradigm: we know hypervisor switches are stupid and cant see beyond VLANs, so
well make the network smarter (aka VM-aware networking).
Target environment: large enterprise networks and those that build cloud solutions with existing
software using VLAN-based virtual switches.
Between Nicira and Cisco/Juniper switches: few. Large cloud providers already got rid of enterprise
kludges and use simple L2 or L3 fabrics. Facebooks, Googles and Amazons of the world run on IP;
they dont care much about TRILL-like inventions. Some of them buy equipment from Juniper, Cisco,
Force10 or Arista, some of them build their own boxes, but however they build their network, that
Between Nicira and Ciscos Nexus 1000V: not at the moment. Open vSwitch runs on Xen/KVM,
Nexus 1000V runs on VMware/Hyper-V. Open vSwitch runs on vSphere, but with way lower
throughput than Nexus 1000V. Obviously Cisco could easily turn Nexus 1000V VSM into an
OpenFlow controller (I predicted that would be their first move into OpenFlow world, and was proven
dead wrong) and manage Open vSwitches, but there's nothing at the moment to indicate they're
considering it.
Between BigSwitch/NEC and Cisco/Juniper. This one will be fun to watch, more so with IBM, Brocade
and HP clearly joining the OpenFlow camp and Juniper cautiously being on the sidelines.
However, Nicira might trigger an interesting mindset shift in the cloud aspirant community: all of a
sudden, Xen/OpenStack/Quantum makes more sense from the scalability perspective. A certain
virtualization vendor will indubitably notice that ... unless they already focused their true efforts on
PaaS (at which point all of the above becomes a moot point).
Many end-users (including Microsoft, a founding member of ONF) and vendors took a different
approach, and created solutions that use traditional networking protocols in a different way, rely on
overlays to reduce the complexity through decoupling, or use a hierarchy of control planes to
achieve better resilience.
This chapter starts with a blog post describing the alternate approaches to SDN and documents
several potentially usable protocols and solutions.
MORE INFORMATION
Youll find additional SDN- and OpenFlow-related information on ipSpace.net web site:
This material is copyrighted and licensed for the sole use by Dimitar Stojanovski (dimitar.s@gmail.com [164.143.240.34]). More information at http://www.ipSpace.net/Webinars
IN THIS CHAPTER:
THE FOUR PATHS TO SDN
THE MANY PROTOCOLS OF SDN
EXCEPTION ROUTING WITH BGP: SDN DONE RIGHT
NETCONF = EXPECT ON STEROIDS
DEAR $VENDOR, NETCONF != SDN
WE NEED BOTH OPENFLOW AND NETCONF
CISCO ONE: MORE THAN JUST OPENFLOW/SDN
THE PLEXXI CHALLENGE (OR: DONT BLAME THE TOOLS)
I2RS JUST WHAT THE SDN GOLDILOCKS IS LOOKING FOR?
This definition is too narrow for most use cases, resulting in numerous solutions and architectures
being branded as SDN. Most of these solutions fall into one of the four categories described in the
blog post I wrote in August 2014.
As always, each approach has its benefits and drawbacks, and theres no universally best solution.
You just got four more (somewhat immature) tools in your toolbox. And now for the details.
That definition, while serving the goals of ONF founding members, is at the moment mostly
irrelevant for most enterprise or service provider organizations, which cannot decide to become a
router manufacturer to build a few dozens of WAN edge routers and based on the amount of
FYI, Im not blaming OpenFlow. OpenFlow is just a low level tool that can be extremely
handy when youre trying to implement unusual ideas.
I am positive there will be people building OpenFlow controllers controlling forwarding fabrics, but
theyll eventually realize what a monumental task they undertook when theyll have to reinvent all
the wheels networking industry invented in the last 30 years including:
Topology discovery;
Fast failure detection (including detection of bad links, not just lost links);
Fast reroute around failures;
Path-based forwarding and prefix-independent convergence;
Scalable linecard protocols (LACP, LLDP, STP, BFD ).
The decoupling approach works well assuming there are no leaky abstractions (in other words, the
overlay can ignore the transport network which wasnt exactly the case in Frame Relay or ATM
networks). Overlay virtual networks work well over fabrics with equidistant endpoints, and fail as
miserably as any other technology when being misused for long-distance VLAN extensions.
VENDOR-SPECIFIC APIS
After the initial magical dust of SDN-washing settled down, few vendors remained standing (Im
skipping those that allow you to send configuration commands in XML envelope and call that
programmability):
Arista has eAPI (access to EOS command line through REST) as well as the capability to install
any Linux component on their switches, and use programmatic access to EOS data structures
(sysdb);
Ciscos OnePK gives you extensive access to inner working of Cisco IOS and IOS XE (havent
found anything NX-OS-related on DevNet);
Juniper has some SDK thats safely tucked behind a partner-only regwall. Just the right thing to
do in 2014.
Not surprisingly, vendors love you to use their API. After all, thats the ultimate lock-in they can get.
Finally, dont forget that weve been using remote-triggered black holes for years (the RFC
describing it is five years old, but the technology itself is way older) we just didnt know we were
doing SDN back in those days.
If youre planning to implement novel ideas in the data center, overlay virtual networks might be the
way to do (more so as you can change the edge functionality without touching the physical
networking infrastructure).
Do you need flexible dynamic ACLs or PBR? Use OpenFlow (or even better, DirectFlow if you have
Arista switches).
Looking for a large-scale solution that controls the traffic in LAN or WAN fabric? BGP might be the
way to go.
Finally, you can do things you cannot do with anything else with some vendor APIs (but do
remember the price youre paying).
Remote Triggered Black Holes is one of the oldest solutions using BGP as the mechanism to modify
networks forwarding behavior from a central controller.
Some network virtualization vendors use BGP to build MPLS/VPN-like overlay virtual networking
solutions.
I2RS and PCEP (a protocol used to create MPLS-TE tunnels from a central controller) operate on the
control plane parallel to traditional routing protocols). BGP-LS exports link state topology and MPLS-
TE data through BGP.
OVSDB is a protocol that treats control-plane data structures as database tables and enables a
controller to query and modify those structures. Its used extensively in VMwares NSX, but could be
used to modify any data structure (assuming one defines additional schema that describes the
data).
OpenFlow, MPLS-TP, ForCES and Flowspec (PBR through BGP used by creative network operators
like CloudFlare) work on the data plane and can modify the forwarding behavior of a controlled
device. OpenFlow is the only one of them that defines data-to-control-plane interactions (with the
Packet In and Packet Out OpenFlow messages).
Interestingly, you dont need new technologies to get as close to that holy grail as you wish; Petr
Lapukhov got there with a 20 year old technology BGP.
THE PROBLEM
Ill use a well-known suboptimal network to illustrate the problem: a ring of four nodes (it could be
anything, from a monkey-designed fabric, to a stack of switches) with heavy traffic between nodes A
and D.
In a shortest-path forwarding environment you cannot spread the traffic between A and D across all
links (although you might get close with a large bag of tricks).
Can we do any better with a controller-based forwarding? We definitely should. Lets see how we can
tweak BGP to serve our SDN purposes.
Obviously Im handwaving over lots of moving parts you need topology discovery, reliable next
hops, and a few other things. If you really want to know all those details, listen to the Packet
Pushers podcast where we deep dive around them (hint: you could also engage me to help you build
it).
Two identical BGP paths (with next hops B and D) to A (to ensure the BGP route selection
process in A uses BGP multipathing);
A BGP path with next hop C to B (B might otherwise send some of the traffic for D to A, resulting
in a forwarding loop between B and A).
MORE INFORMATION
Routing Design for Large-Scale Data Centers (Petrs presentation @ NANOG 55)
Use of BGP for Routing in Large-Scale Data Centers (IETF draft)
Centralized Routing Control in BGP Networks (IETF draft)
WHAT IS NETCONF?
NETCONF (RFC 6421) is an XML-based protocol used to manage the configuration of networking
equipment. It allows the management console (manager) to issue commands and change
configuration of networking devices (NETCONF agents). In this respect, its somewhat similar to
SNMP, but since it uses XML, provides a much richer set of functionality than the simple key/value
pairs of SNMP.
For more details, I would strongly suggest you listen to the NETCONF Packet Pushers podcast.
Its thus possible to write a network management application using a standard MIB that would work
with equipment from all vendors that decided to implement that MIB. For example, should the
Hadoop developers decide to use LLDP to auto-discover the topology of the Hadoop clusters, they
could rely on LLDP MIB being available in switches from most data center networking vendors.
Apart from few basic aspects of session management, no such standardized data structure exists in
the NETCONF world. For example, theres no standardized command (specified in an RFC) that you
could use to get the list of interfaces, shut down an interface, or configure an IP address on an
interface. The drafts are being written by the NETMOD working group, but it will take a while before
they make it to the RFC status and get implemented by major vendors.
Every single vendor that graced us with a NETCONF implementation thus uses its own proprietary
format within the NETCONFs XML envelope. In most cases, the vendor-specific part of the message
maps directly into existing CLI commands (in Junos case, the commands are XML-formatted
because Junos uses XML internally). Could I thus write a NETCONF application that would work with
Cisco IOS and Junos? Sure I could if Id implement a vendor-specific module for every device
family I plan to support in my application.
Using a standard protocol that provides clear message delineation (expect scripts were mainly
guesswork and could break with every software upgrade done on the networking devices) and error
reporting (another guesswork part of the expect scripts) is evidently a much more robust solution,
but its still too little and delivered way too slowly. What we need is a standard mechanism of
configuring a multi-vendor environment, not a better wrapper around existing CLI (although the
better wrapper does come handy).
There might be a yet-to-be-discovered vendor out there that creatively uses NETCONF to change the
device behavior in ways that cannot be achieved by CLI or GUI configuration, but most of them use
NETCONF as a reliable Expect script.
More precisely: what Ive seen being done with NETCONF or XMPP is executing CLI commands or
changing device (router, switch) configuration on-the-fly using a mechanism that is slightly more
reliable than a Perl script doing the same thing over an SSH session. Functionally its the same thing
as typing the exec-level or configuration commands manually (only a bit faster and with no auto-
correct).
What's missing? Few examples: you cannot change the device behavior beyond the parameters
already programmed in its operating system (like you could with iRules on F5 BIG-IP). You cannot
implement new functionality (apart from trivial things like configuring and removing static routes or
packet/route filters). And yet some $vendors I respect call that SDN. Give me a break, I know you
can do better than that.
However, lame joking aside, the definition of SDN as promoted by the Open Networking Foundation
requires the separation of control and data planes, and you simply cant do that with NETCONF. If
anything, ForCES would be the right tool for the job, but youve not heard much about ForCES from
There might be interesting things you could do through network device configuration with NETCONF
(installing route maps with policy-based routing, access lists or static MPLS in/out label mappings,
for example), but installing the same entries via OpenFlow would be way easier, simpler and (most
importantly) device- and vendor-independent.
For example, NETCONF has no standard mechanism you can use today to create and apply an ACL
to an interface. You can create an ACL on a Cisco IOS/XR/NX-OS or a Junos switch or router with
NETCONF, but the actual contents of the NETCONF message would be vendor-specific. To support
devices made by multiple vendors, youd have to implement vendor-specific functionality in your
NETCONF controller. On the contrary, you could install the same forwarding entries (with the DROP
action) through OpenFlow into any OpenFlow-enabled switch (the only question being whether
these entries would be executed in hardware or on the central CPU).
OpenFlow-created entries in the forwarding table are by definition temporary. They dont appear in
device configuration (and are probably fun to troubleshoot because they only appear in the
forwarding table) and are lost on device reload or link loss.
One of the most important messages in the Ciscos ONE launch is OpenFlow is just a small part of
the big picture. Thats pretty obvious to anyone who tried to understand what OpenFlow is all about,
and weve heard that before, but realistic statements like this tend to get lost in all the hype
generated by OpenFlow zealots and industry press.
The second, even more important message is lets not reinvent the wheel. Google might have the
needs and resources to write their own OpenFlow controllers, northbound API, and custom
applications on top of that API; the rest of us would just like to get our job done with minimum
hassle. To help us get there, Cisco plans to add One Platform Kit (onePK) API to IOS, IOS-XR and
NX-OS.
OpenFlow has the same problem its useless without a controller with a northbound API, and
theres no standard northbound API at the moment. If I want to modify packet filters on my wireless
access point, or create a new traffic engineering tunnel, I have to start from scratch.
Thats where onePK comes in it gives you high-level APIs that allow you to inspect or modify the
behavior of the production-grade software you already have in your network. You dont have to deal
OPEN OR PROPRIETARY?
No doubt the OpenFlow camp will be quick to claim onePK is proprietary. Of course it is, but so is
almost every other SDK or API in this industry. If you decide to develop an iOS application, you
cannot run it on Windows 7; if your orchestration software works with VMwares API, you cannot use
it to manage Hyper-V.
The real difference between networking and most of the other parts of the IT is that in networking
you have a choice. You can use onePK, in which case your application will only work with Cisco IOS
and its cousins, or you could write your own application stack (or use a third party one) using
OpenFlow to communicate with the networking gear. The choice is yours.
MORE DETAILS
You can get more details about Cisco ONE on Ciscos web site and its data center blog, and a
number of bloggers published really good reviews:
In a recent blog post Marten Terpstra hinted at shortcomings of Shortest Path First (SPF) approach
used by every single modern routing algorithm. Lets take a closer look at why Plexxis engineers
couldnt use SPF.
Four lambdas (40 Gbps) are used to connect to the adjacent (east and west) switch;
Two lambdas (20 Gbps) are used to connect to four additional switches in both directions.
The beauty of Plexxi ring is the ease of horizontal expansion: assuming you got the wiring right, all
you need to do to add a new ToR switch to the fabric is to disconnect a cable between two switches
and insert a new switch between them as shown in the next diagram. You could do it in a live
network if the network survives a short-term drop in fabric bandwidth while the CWDM ring is
reconfigured.
Central controllers (well known from SONET/SDH, Frame Relay and ATM days);
Distributed traffic engineering (thoroughly hated by anyone who had to operate a large MPLS TE
network close to its maximum capacity).
Plexxi decided to use a central controller, not to provision the virtual circuits (like we did in ATM
days) but to program the UCMP (Unequal Cost Multipath) forwarding entries in their switches.
Does that mean that we should forget all we know about routing algorithms and SPF-based ECMP
and rush into controller-based fabrics? Of course not. SPF and ECMP are just tools. They have well-
known characteristics and well understood use cases (for example, they work great in leaf-and-spine
fabrics). In other words, dont blame the hammer if you decided to buy screws instead of nails.
Interface to the Routing System (I2RS) is a new initiative that should provide just what we might
need in those cases. To learn more about IRS, you might want to read the problem statement and
framework drafts, view the slides presented at IETF84, or even join the irs-discuss mailing list.
Even if you dont want to know those details, but consider yourself a person interested in routing
and routing protocols, do read two excellent e-mails written by Russ White: in the first one he
explained how IRS might appear as yet another routing protocol and benefit from the existing
routing-table-related infrastructure (including admin distance and route redistribution), in the
second one he described several interesting use cases.
Is I2RS the SDN porridge we're looking for? Its way too early to tell (we need to see more than an
initial attempt to define the problem and the framework), but the idea is definitely promising.