Professional Documents
Culture Documents
Step 6 No Step 5
Deployment and Model validation
inference (cross validation,
(tradeoff on speed, Meet error analysis)
Yes
memory, stability and requirements?
accuracy of inference)
Networking
Steps of MLN workflow
application
Data collection
Problem Deployment and online
Objectives Specific works Data analysis Offline model construction
formulation Online inference
Offline collection
measurement
Combine data of platforms
Infor- with a few powerful VPs in Take users’ Construct RuleFit model to Optimize measurement budget
Sibyl [11]: route SL: prediction with
mation homogeneous deployment and queries as input / assign confidence to each in each round to get the best
measurement RuleFit
cognition with many limited VPs around round by round predicted path query coverage
the world
Resource RL: decision Synthetic workload with The real time Action space is too large and
DeepRM [13]: Offline training to update the Directly schedule the arrival jobs
manage- making with different patterns is used for resource demand may has conflicts between
job scheduling policy network with the trained model
ment deep RL training of the arrival job actions
It is difficult to characterize
SL: decision Take the Layer-Wise training Record and collect the traffic
Traffic patterns labeling with Online traffic the input and output patterns
Ref [2]: routing making with Deep to initialize and the patterns in each router
routing paths computed by patterns in each to reflect the dynamic nature
strategy Belief Architectures backpropagation process to periodically and obtain the next
OSPF protocol router of large-scale heterogeneous
(DBA) fine-tune the DBA structure routing nodes from the DBAs
networks
RL: decision
Pytheas [7]: Session quality Application sessions sharing Backend cluster determines the Frontend performs the group-
making with a Session quality information with
general QoE information in the same features can be session groups using CFA [5] based exploration-exploitation
variant of UCB features in large time scale
Network optimization small time scale grouped with a long time scale strategy in real time
algorithm
adaption
Given network assumption the Directly implement the
Remy [3]: TCP RL: decision Calculate network
Collect experience from Select the most influential generated algorithm interact with Remy-generated algorithm to
congestion making with a state variables
network simulator metrics as state variables simulator to learn best actions corresponding network
control tabular method with ACK
according to states environment
Calculate the TCP assumptions are often Take trials with different sending
PCC [4]: TCP RL: decision
utility function violated. The direct rates and find the best rate
congestion making with online / /
according the performance is a better according to the feedback utility
control learning
received SACK signal function
Take session
CFA [5]: USL: clustering Datasets consisting of quality Similar sessions are with Critical feature learning in
features as input, Look up feature-quality table to
video QoE with self-designed measurements are collected similar quality determined by minutes scale and quality
such as Bitrate, respond to real-time query
Perfor- optimization algorithm from public CDNs critical features estimation in tens of seconds
CDN, Player, etc.
mance
prediction CS2P [1]: A new session is mapped to the
Take users’ s Sessions with similar features Find set of critical feature and
SL: prediction with Datasets of HTTP throughput most similar session cluster and
throughput session features tend to behave in related learn a HMM for each cluster of
HMM measurement from iQIYI corresponding HMM are used to
prediction as input pattern similar sessions
predict throughput
treated as a pioneer paradigm to import machine Several attempts have been made to optimize
learning into networking fields. the TCP congestion control algorithm using the
reinforcement learning approach due to the dif-
Resource Management and Network Adaption ficulty of designing a congestion control algo-
Efficient resource management and network adap- rithm that can fit all network states. To make
tion are the keys to improving network system the algorithm self-adaptive, Remy [3] takes the
performance. Some example issues to address are target network assumptions and traffic model as
traffic scheduling, routing [2], and TCP congestion prior knowledge to automatically generate the
control [3, 4]. All these issues can be formulated specific algorithm, which achieves an amazing
as a decision-making problem [13]. However, it is performance gain in many circumstances. In the
challenging to solve these problems with a rule- offline phase, Remy tries to learn a mapping, i.e.,
based heuristic algorithm due to the complexity of RemyCC, between the network state and the cor-
diverse system environments, noisy inputs and diffi- responding parameters of the congestion window
culty in optimizing the tail performance [13]. Spe- (cwnd) by interacting with the network simulator.
cifically, arbitrary parameter assignments based on In the online phase, whenever an ACK is received,
experiences and action taken following predeter- RemyCC looks up its mapping table and changes
mined rules often result in a scheduling algorithm its cwnd behavior according to the current net-
that is understood by people but far from optimal. work state. The mechanism of Remy is illustrated
Deep learning is a promising solution due to in Fig. 2. Without the specific network assump-
its ability to characterize the inherent relation- tions, a performance-oriented attempt, PCC
ships between the inputs and outputs of network [4], can benefit from its online-learning nature.
systems without human involvement. In order Although these TCP-related efforts still focus on
to meet the requirements of changing network decision making, they take the first important step
environments, previous efforts in [2, 14] design toward automated protocol design.
a traffic control system with the support of deep
learning techniques. Reconsidering backbone Network Performance
router architectures and strategies, it takes the Prediction and Configuration Extrapolation
traffic pattern in each router as input and outputs Performance prediction can guide decision mak-
the next nodes in the routing path with Deep ing. Some example applications are video QoE
Belief Architectures. These advancements unleash prediction, CDN location selection, best wireless
the potential of the DL-based strategy in network channel selection, and performance extrapolation
routing and scheduling. Harnessing the powerful under different configurations. Machine learning
representational ability of deep neural networks, is a natural approach to predict system states for
deep reinforcement learning achieves great better decision making.
results in many AI problems. Typically, there are two general prediction sce-
DeepRM [13] is the first work that applies a narios. First, the system owner has the ability to
deep RL algorithm for cluster resource scheduling. get various and enough historical data, but it is
Its performance is comparable to state-of-the-art non-trivial to build a complex prediction model
heuristic algorithms but with less cost. The QoE and update it in real time, which requires a new
optimization problem can also benefit from the approach exploiting domain-specific knowledge
RL learning methodology. Unlike previous efforts, to simplify the problem (e.g., CFA [5] for video
Pytheas [7] regards this problem as an explora- QoE optimization). In prior work, CS2P [1] wants
tion-exploitation-based problem rather than a to improve video bitrate selection with accu-
prediction-based problem. As a result, Pytheas rate prediction. It finds that sessions with similar
outperforms state-of-the-art prediction-based sys- key features may have more related throughput
tems by lessening the prediction bias and delayed behavior from data analysis. CS2P learns to clus-
response. From this perspective, machine learning ter similar sessions offline and trains different Hid-
may help achieve the close-loop of “sensing-anal- den-Markov Models for each cluster to predict
ysis-decision,” especially in wireless sensor net- the corresponding throughput given the current
works, where the three actions are separated session information. CS2P reinforces the correla-
from each other at present. tion of similar sessions in the training process,
Objectives Specific works Offline time cost Online time cost Device information
which outperforms approaches with one single lems. Other reasons that prevent the application
model. This is very similar to the above mentioned of ML techniques include the lack of labeled data,
traffic prediction problem, since they both pas- high system dynamics and high cost brought by
sively fit the runtime ground-truth with a certain learning errors.
metric. As another prediction scenario, little his-
torical data exist and it is infeasible to obtain rep- Opportunities for MLN
resentative data by conducting performance tests The prior efforts mostly focus on the generalized
due to high trial costs in real network systems. To concepts of prediction and classification and
deal with this dilemma, cherrypick [15] leverages few can get out of this scope to explore other
the Bayesian Optimization algorithm to minimize possible applications. However, with the latest
pre-run rounds with a directional guidance to breakthroughs in machine learning and its infra-
collect representative runtime data of workloads structures, new potential demands may appear in
under different configurations. network disciplines. Some opportunities are intro-
duced as follows.
Feasibility Discussion
One big challenge faced by ML-based methods Open Datasets for the Networking Community
is their feasibility. Since many networking applica- Collecting a large amount of high quality data that
tions are delay-sensitive, it is non-trivial to design contain both network profiles and performance
a real-time system with heavy computation loads. metrics is one of the most critical issues for MLN.
To make it practical, a common solution is to train However, acquiring enough labeled data is still
the model with global information for a long peri- expensive and labor intensive even in today’s
od of time and incrementally update the model machine learning community. For many reasons,
with local information in a small time scale [5, it is not easy for researchers to acquire enough
7], which trades off between the computation real trace data even if there are many existing
overhead and information staleness. In the online open datasets in the networking domain.
phase, the common case is to look up the result This reality drives us to learn from the machine
table or draw the inference with a trained model learning community to put much more effort into
to make real-time decisions. The processing time constructing open datasets like ImageNet. With
in the above advances are selectively listed in unified open datasets, performance benchmarks
Table 2, which shows that ML has practical values are an inevitable outcome to provide a standard
with the system well-designed. In addition, the platform for researchers to compare their new
robustness and generalization of a design are also algorithms or architectures with state-of-the-
important for feasibility and are discussed later. art ones. This can reduce the unrepresentative
From these perspectives, ML in its current state repeated experiments and have a positive effect
is not suitable for all networking problems. The on academic loyalty. In addition, it has been
network problems solved with ML techniques so proved in the machine learning domain that learn-
far are more or less related to prediction, classi- ing with a simulator rather than in a real environ-
fication and decision-making, while it is difficult ment is more effective and with lower cost in
to apply machine learning to other types of prob- RL scenarios [3]. In the networking domain, due