Machine Learning Methods in Coherent Optical Communication Systems · Machine Learning Methods in Coherent Optical Communication Systems Jones, Rasmus Thomas Publication date: 2019

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

Users may download and print one copy of any publication from the public portal for the purpose of private study or research.

You may not further distribute the material or use it for any profit-making activity or commercial gain

You may freely distribute the URL identifying the publication in the public portal If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from orbit.dtu.dk on: Jun 30, 2020

Machine Learning Methods in Coherent Optical Communication Systems

Jones, Rasmus Thomas

Publication date:2019

Document VersionPublisher's PDF, also known as Version of record

Link back to DTU Orbit

Citation (APA):Jones, R. T. (2019). Machine Learning Methods in Coherent Optical Communication Systems. TechnicalUniversity of Denmark.

https://orbit.dtu.dk/en/publications/c726fa35-197e-442d-8f3d-8a0a78ef8e81

Ph.D. ThesisDoctor of Philosophy

Machine Learning Methods inCoherent Optical CommunicationSystems

Rasmus Thomas Jones

Kongens Lyngby 2019

DTU FotonikDepartment of Photonics EngineeringTechnical University of Denmark

Ørsteds PladsBuilding 3432800 Kongens Lyngby, DenmarkPhone +45 4525 6352www.fotonik.dtu.dk

AbstractThe future demand for digital information will outpace the capabilities ofcurrent optical communication systems, which are approaching their limitsdue to fiber intrinsic nonlinear effects. Machine learning methods promiseto find new ways of exploiting the available resources, and to handle futurechallenges in larger and more complex systems.

The methods presented in this work apply machine learning on optical com-munication systems, with three main contributions. First, a machine learningframework combines dimension reduction and supervised learning, addressingcomputational expensive steps of fiber channel models. The trained algorithmallows more efficient execution of the models. Second, supervised learning iscombined with a mathematical technique, the nonlinear Fourier transform,realizing more accurate detection of high-order solitons. While the techniqueaims to overcome the intrinsic fiber limitations using a reciprocal effect be-tween chromatic dispersion and nonlinearities, there is a non-deterministicimpairment from fiber loss and amplification noise, leading to distorted soli-tons. A machine learning algorithm is trained to detect the solitons despite thedistortions, and in simulation studies the trained algorithm outperforms thestandard receiver. Third, an unsupervised learning algorithm with embeddedfiber channel model is trained end-to-end, learning a geometric constellationshape that mitigates nonlinear effects. In simulation and experimental stud-ies, the learned constellations yield improved performance to state-of-the-artgeometrically shaped constellations.

The contributions presented in this work show how machine learning canbe applied together with optical fiber channel models, and demonstrate, thatmachine learning is a viable tool for increasing the capabilities of optical com-munications systems.

ii

ResuméDen fremtidige efterspørgsel for digital information vil overhale kapaciteten afde nuværende optiske kommunikationssystemer, der nærmer sig deres grænsepå grund af ikke-lineære fiber effekter. Maskinlæringsmetoder lader os findenye måder at udnytte de eksisterende ressourcer og håndterer udfordringer istørre og mere komplekse systemer.

Metoderne præsenteret i denne afhandling er maskinlæringsmetoder foroptiske kommunikationssystemer, inddelt i tre hovedkategorier. Først, etmaskinlærings-framework der ved at kombinerer dimensionsreduktion og su-perviseret læring adresserer udregningstunge skridt i fiber modellering. Dentrænede algoritme giver adgang til mere effektive udregninger i modellerne.I den anden metode bliver superviseret læring kombineret med en matema-tisk teknik; den ikke lineære Fourier transformation, der gør det muligt atdetekterer højere ordens solitoner. Teknikken søger at overkomme fiber be-grænsninger ved at bruge en reciprok effekt imellem kromatisk dispersion ogikke-lineariteter. Der er en ikke-deterministisk fejlkilde fra tabet i fiberen ogforstærkerstøj, hvilket fører til forvrængede solitoner. En maskinlæringsalgo-ritme bliver trænet til at detektere solitonerne på trods af forvrængningerneog i simuleringer fungerede algoritmen bedre end en standard modtager. Dentredje og sidste metode bruger en ikke-superviseret-læringsalgoritme med plantetfiber-kanal-model. Denne algoritme bliver trænet end-to-end for at opnå engeometrisk konstellationsform der mitigerer ikke-lineære effekter. I både simu-leringer og eksperimenter har denne algoritme øget virkningen af state-of-the-art geometriske konstellationer.

Arbejdet præsenteret her viser hvordan maskinlæring kan blive brugt sam-men med den optiske fibermodel og demonstrer at maskinlæring er et nyttigtredskab til at øge kapaciteten af optiske kommunikationssystemer.

ii

PrefaceThe work presented in this thesis was carried out as a part of my Ph.D. projectin the period November 1st, 2015 to October 31st, 2018. The work tookplace at the Department of Photonics Engineering, Technical University ofDenmark, Kgs. Lyngby, Denmark, with an external stay of six months at thePhotonic Network System Laboratory, National Institute of Information andCommunications Technology, Tokyo, Japan.

This Ph.D. project was financed by Keysight Technologies (Germany, Böblin-gen) and by the European Research Council through the ERC-CoG FRECOMproject (grant agreementno. 771878), and supervised by:

• Darko Zibar (main supervisor), Associate Professor, Department of Pho-tonics Engineering, Technical University of Denmark, 2800 Kgs. Lyngby,Denmark

• Metodi P. Yankov (co-supervisor), Fingerprint Cards A/S, 2730 Herlev,Denmark, and Department of Photonics Engineering, Technical Univer-sity of Denmark, 2800 Kgs. Lyngby, Denmark

• Molly Piels (co-supervisor), Juniper Networks, 1133 Innovation Way,Sunnyvale, California 94089, USA, and former Department of Photon-ics Engineering, Technical University of Denmark, 2800 Kgs. Lyngby,Denmark

iv

AcknowledgementsFirst I would like to thank my supervisor at DTU, Darko Zibar, for giving methe opportunity to pursue this Ph.D., and together with my co-supervisors,Metodi P. Yankov and Molly Piels, for the guidance, encouragement and manyfruitful discussions over the last three years.

Thanks to the High Speed Optical Communications (HSOC) group, forgreat laughs around the lunch table and during the coffee breaks, in particular,to Francesco for always being reliable, and to Lisbeth for herding the cats andcoordinating us scientists.

Further, I would like to thank Ben Puttnam from the National Institute ofInformation and Communications Technology for hosting me for my externalresearch stay in Tokyo, and together with Georg, Ruben, Tobias and Werner,for introducing me to the Japanese culture, navigating me around the Japanesecuisine and borrowing me camping gear. Arigato for the collaborations, andfor teaching me everything I needed to know in the lab and the wonders of anespresso machine.

Thanks to the DSP cowboys, Júlio, Simone, Edson, Martin and Stenio, forfruitful discussions, collaborations, flip chart art projects, Friday bar beers,and simply sharing the office with a friendly and daily “Good morning”.

Thank you to the sweetest flat-mates, Tasja and Micha, and lately, Sara,for the time in the kitchen we spent and providing me with a home to arriveto every evening.

Many thanks to my friends in Copenhagen and back in Germany for con-stantly reaching out, and all your diverse input forcing me to keep an openmind. Thanks to Rob, Andi and Carys for visiting me in Japan.

Thank you to my family for their unconditional support throughout theyears. In particular, my parents, Suzan and Simon, and my brother, Mark,thank you for your love and visiting me in Japan. Thank you, Èva, my partnerin crime.

vi

Acknowledgements vii

It is more fun to talk with someone whodoesn’t use long, difficult words but rathershort, easy words like, ”What about lunch?”

- Winnie the Pooh

viii

Ph.D. PublicationsThe following publications have resulted from this Ph.D. project.

Articles in peer-reviewed journals[J1] J. C. M. Diniz, F. Da Ros, E. P. da Silva, R. T. Jones, and D. Zibar,

“Optimization of DP-M-QAM Transmitter Using Cooperative Coevolu-tionary Genetic Algorithm,” Journal of Lightwave Technology, vol. 36,no. 12, pp. 2450–2462, 2018.

[J2] R. T. Jones, S. Gaiarin, M. P. Yankov, and D. Zibar, “Time-DomainNeural Network Receiver for Nonlinear Frequency Division MultiplexedSystems,” IEEE Photonics Technology Letters, vol. 30, no. 12, pp. 1079–1082, 2018.

[J3] J. Thrane, J. Wass, M. Piels, J. C. Diniz, R. Jones, and D. Zibar,“Machine learning techniques for optical performance monitoring fromdirectly detected PDM-QAM signals,” Journal of Lightwave Technol-ogy, vol. 35, no. 4, pp. 868–875, 2017.

Articles submitted to peer-reviewed journals[S1] R. T. Jones, T. A. Eriksson, M. P. Yankov, B. J. Puttnam, G. Rademacher,

R. S. Luıs, and D. Zibar, “Geometric Constellation Shaping for FiberOptic Communication Systems via End-to-end Learning,” arXiv preprintarXiv:1810.00774, 2018.

x Ph.D. Publications

Contributions to international peer-reviewedconferences[C1] R. T. Jones, T. A. Eriksson, M. P. Yankov, and D. Zibar, “Deep

Learning of Geometric Constellation Shaping including Fiber Nonlin-earities,” in European Conf. Optical Communication (ECOC), IEEE,2018, We1F.5.

[C2] R. T. Jones, S. Gaiarin, M. P. Yankov, and D. Zibar, “Noise Ro-bust Receiver for Eigenvalue Communication Systems,” in Optical FiberCommunication Conference, IEEE, 2018, pp. 1–3.

[C3] B. J. Puttnam, G. Rademacher, R. S. Luıs, R. T. Jones, Y. Awaji,and N. Wada, “Impact of Temperature on Propagation Time and Inter-Core Skew in a 19-Core Fiber,” in Asia Communications and PhotonicsConference, Optical Society of America, 2018.

[C4] S. Savian, J. C. M. Diniz, A. P. T. Lau, F. N. Khan, S. Gaiarin,R. Jones, and D. Zibar, “Joint Estimation of IQ Phase and GainImbalances Using Convolutional Neural Networks on Eye Diagrams,”in CLEO: Science and Innovations, Optical Society of America, 2018,STh1C–3.

[C5] J. C. M. Diniz, F. Da Ros, R. T. Jones, and D. Zibar, “Time skewestimator for dual-polarization QAM transmitters,” in European Conf.Optical Communication (ECOC), IEEE, 2017, pp. 1–3.

[C6] R. T. Jones, J. C. Diniz, M. Yankov, M. Piels, A. Doberstein, andD. Zibar, “Prediction of second-order moments of inter-channel inter-ference with principal component analysis and neural networks,” inEuropean Conf. Optical Communication (ECOC), IEEE, vol. 1, 2017.

[C7] J. Wass, J. Thrane, M. Piels, R. Jones, and D. Zibar, “Gaussian pro-cess regression for WDM system performance prediction,” in OpticalFiber Communication Conference, Optical Society of America, 2017,Tu3D–7.

[C8] S. Gaiarin, X. Pang, O. Ozolins, R. T. Jones, E. P. Da Silva, R.Schatz, U. Westergren, S. Popov, G. Jacobsen, and D. Zibar, “Highspeed PAM-8 optical interconnects with digital equalization based onneural network,” in Asia Communications and Photonics Conference,Optical Society of America, 2016, AS1C–1.

Contributions to international peer-reviewed conferences xi

[C9] J. Wass, J. Thrane, M. Piels, J. C. Diniz, R. Jones, and D. Zibar,“Machine Learning for Optical Performance Monitoring from DirectlyDetected PDM-QAM Signals,” in European Conf. Optical Communica-tion (ECOC), VDE, 2016, pp. 1–3.

xii

ContentsAbstract i

Resumé i

Preface iii

Acknowledgements v

Ph.D. Publications ix

Acronyms xvii

1 Introduction 11.1 Motivation and Outline of Contributions . . . . . . . . . . . . . 41.2 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 6

2 Coherent communication systems 92.1 The Transmitter . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Lasers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.2 Modulation Formats . . . . . . . . . . . . . . . . . . . . 112.1.3 Electrical Signal Generation . . . . . . . . . . . . . . . . 112.1.4 The Optical Modulator . . . . . . . . . . . . . . . . . . 12

2.2 The Additive White Gaussian Noise Channel . . . . . . . . . . 132.3 The Optical Fiber Channel . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Split-step Fourier Method . . . . . . . . . . . . . . . . . 152.3.2 Simple XPM Model . . . . . . . . . . . . . . . . . . . . 162.3.3 Gaussian Noise Model . . . . . . . . . . . . . . . . . . . 192.3.4 Nonlinear Interference Noise Model . . . . . . . . . . . . 202.3.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 The Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.4.1 Coherent Detection . . . . . . . . . . . . . . . . . . . . . 23

xiv Contents

2.4.2 Digital Signal Processing . . . . . . . . . . . . . . . . . 232.5 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5.1 Signal-to-Noise Ratio . . . . . . . . . . . . . . . . . . . . 252.5.2 Optical Signal-to-Noise Ratio . . . . . . . . . . . . . . . 252.5.3 Pre-FEC Bit Error Ratio . . . . . . . . . . . . . . . . . 252.5.4 Mutual Information . . . . . . . . . . . . . . . . . . . . 26

2.6 Nonlinear Fourier Transform Based Communication Systems . 262.7 Constellation Shaping . . . . . . . . . . . . . . . . . . . . . . . 28

3 Machine Learning Methods 313.1 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . 323.1.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . 333.1.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . 34

3.2 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.2.3 Test and Validation . . . . . . . . . . . . . . . . . . . . 393.2.4 Activation Functions . . . . . . . . . . . . . . . . . . . . 403.2.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . 40

3.3 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.5 Dimension Reduction and Feature Extraction . . . . . . . . . . 45

3.5.1 Principal Component Analysis . . . . . . . . . . . . . . 453.5.2 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . 49

3.6 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.6.1 Rectified Linear Unit . . . . . . . . . . . . . . . . . . . . 503.6.2 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . 513.6.3 Automatic differentiation . . . . . . . . . . . . . . . . . 51

4 Prediction of Fiber Channel Properties 534.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2 Machine Learning Framework . . . . . . . . . . . . . . . . . . . 544.3 Training and Prediction . . . . . . . . . . . . . . . . . . . . . . 564.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . 574.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5 Neural Network Receiver for NFDM Systems 615.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Contents xv

5.2 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 625.3 Receiver Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.3.1 NFT Receiver . . . . . . . . . . . . . . . . . . . . . . . . 635.3.2 Minimum Distance Receiver . . . . . . . . . . . . . . . . 645.3.3 Neural Network Receiver . . . . . . . . . . . . . . . . . 65

5.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . 655.4.1 Receiver Comparison over the AWGN Channel . . . . . 655.4.2 Temporal and Spectral Evolution . . . . . . . . . . . . . 665.4.3 Receiver Comparison over the Fiber Optic Channel . . . 675.4.4 Neural Network Receiver Hyperparameters . . . . . . . 69

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6 Geometric Constellation Shaping via End-to-end Learning 736.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.2 End-to-end Learning . . . . . . . . . . . . . . . . . . . . . . . . 75

6.2.1 Autoencoder Model Architecture . . . . . . . . . . . . . 756.2.2 Autoencoder Model Training . . . . . . . . . . . . . . . 766.2.3 Mutual Information Estimation . . . . . . . . . . . . . . 786.2.4 Learned Constellations . . . . . . . . . . . . . . . . . . . 78

6.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.4 Experimental Demonstration . . . . . . . . . . . . . . . . . . . 836.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.5 Joint Optimization of Constellation and System Parameters . . 866.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.6.1 Mitigation of Nonlinear Effects . . . . . . . . . . . . . . 876.6.2 Model Discrepancy . . . . . . . . . . . . . . . . . . . . . 886.6.3 Digital Signal Processing Drawbacks . . . . . . . . . . . 88

6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

7 Conclusion 917.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 917.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

A Softmax Activation and Mutual Information Estimation 95

Bibliography 97

xvi

AcronymsADC analog-to-digital converter

AE autoencoder

AI artificial intelligence

AIR achievable information rate

ASE amplified spontaneous emission

AWG arbitrary waveform generator

AWGN additive white Gaussian noise

BER bit error rate

BPSK binary phase shift keying

COI channel of interest

DAC digital-to-analog converter

DBP digital back-propagation

DSP digital signal processing

EDFA erbium-doped fiber amplifier

EGN extended Gaussian noise

FEC forward error correction

FWM four-wave mixing

GMI generalized mutual information

GN Gaussian noise

xviii Acronyms

GPU graphics processing unit

HWHM half width at half maximum

I in-phase

IM/DD intensity modulation/direct detection

IPM iterative polar modulation

LMS least mean squares

MI mutual information

ML maximum likelihood

MSE mean squared error

MZM Mach-Zehnder modulator

NFDM nonlinear frequency-division multiplexing

NFT nonlinear Fourier transform

NLI nonlinear interference

NLIN nonlinear interference noise

NLSE nonlinear Schrödinger equation

OSNR optical signal-to-noise ratio

PCA principal component analysis

PRBS pseudorandom binary sequence

PSD power spectral density

Q quadrature

QAM quadrature amplitude modulation

QPSK quadrature phase-shift keying

ReLU rectified linear unit

SDM space-division multiplexing

Acronyms xix

SGD stochastic gradient descent

SSMF standard single-mode fiber

SNR signal-to-noise ratio

SPM self-phase modulation

SSFM split-step Fourier method

VOA variable optical attenuators

WDM wavelength-division multiplexing

XPM cross-phase modulation

xx

CHAPTER 1Introduction

The globalization of our society is largely attributed to progress in communica-tion technology [1]. We depend on services providing digital information andconnecting us at all times via email, instant messaging, and audio or videocalls. Data has become an economic resource distributed on the Internet bya global infrastructure based on fiber optic communication technology. Themodern world is highly dependent on the Internet to reliably move data acrosscountries and continents. In 2021, the global Internet traffic is expected toreach 278 exabyte per month [2]. Transmitting data at today’s data rates andlatencies is only possible with the bandwidth and low loss provided by opticalfibers as transmission medium [3].

Multiple technological breakthroughs have enhanced the available datarates of fiber optic systems in the past decades. A fiber was first developedin 1966 [4], leading to standard single-mode fiber (SSMF), providing low lossand larger transmission distances than previous guided media [5]. Optical am-plifiers, such as erbium-doped fiber amplifier (EDFA) [6], replaced electronicregeneration schemes, and enabled all optical and cost effective amplificationof multiple wavelength-division multiplexing (WDM) channels [7]. Within thecoherent era starting in the late 2000s, the usable EDFA bandwidth reached5 THz. Coherent transceiver technology [8, 9] could be built on to existent fiberinfrastructure giving access to both quadratures and polarizations of each opti-cal carrier. In combination with increasing analog-to-digital converter (ADC)speeds, coherent technology led to incredible growth in data throughput dur-ing this decade. In 2018, a spectral efficiency within a SSMF of 17.3 bit/s/Hzhas been reported [10], which is more than 40 times the spectral efficiencyobtained in the late 1990s [3].

Consumer demand for digital information is growing exponentially [2],driven by new applications on ever increasing number of devices. Bandwidthheavy multi-user virtual reality entertainment services and the Internet ofThings pose challenging demands on the existing infrastructure and availablelatencies. Further, machine to machine communication will outpace the traffic

2 1 Introduction

consumed by humans [11]. Within data centers, large scale machine learningalgorithms demand repeated access to a vast amount of data [12, 13], teach-ing themselves from human labeled examples. Future networks must furtherscale, while staying cost and power efficient [14]. The growing network com-plexity requires new digital signal processing (DSP) algorithms, low powerdevices, self-monitoring systems as well as adaptive and flexible routing mech-anisms [3].

Increasing the spectral efficiency within an SSMF even further requireshigh-order modulation formats, which demand larger optical launch powercounteracting amplification noise. However, higher power levels trigger theperformance limiting nonlinear response of SSMF [15], due to the fiber intrin-sic Kerr effect. The Kerr effect describes power dependent nonlinear inter-actions of signal frequency components. Although the nonlinear effects aredeterministic, they are driven by the random data carried by the WDM sig-nals. In combination with chromatic dispersion and amplification noise, thenonlinear effects make joint processing and signal recovery of multiple WDMchannels challenging [16]. Further, joint processing of multiple WDM channelsis not possible in optically routed systems. When different channels have dis-tinct destinations, the deterministic nonlinear interactions become stochasticfrom a receiver standpoint, since the underlying information of the neigh-bouring channels is unavailable when demodulating and recovering the signal.Hence, a launch power exists for which the amplification noise and nonlineareffects are balanced and which maximizes the transmission rate. This powerdependent limitation is widely referred to as nonlinear Shannon limit [16–19]. Although the existence of such has never been proved, how to addressthe nonlinear Shannon limit has received a lot of attention within fiber opticcommunication research with different potential approaches [20].

Mitigating and compensating nonlinear effects with DSP techniques hasbeen studied [21], where most of these efforts are based on insight from thenonlinear Schrödinger equation (NLSE) [22] and fiber channel models derivedfrom it [23–30]. By propagating the received lightwave systematically througha virtual fiber from receiver to transmitter, digital back-propagation (DBP)simulates the reverse of the complex nonlinear effects defined by the NLSE.It accounts for the deterministic processes of the nonlinear effects, yet it islimited by non-deterministic processes between signal and amplification noise,polarization effects, and its computational demands. Compensation methodsthat build upon perturbation models [29, 31], first or higher-order approx-imations of the NLSE, range from compensating for the accumulated fibernonlinearities in one step [31], to continuously tracking a time varying process

1 Introduction 3

induced by fiber nonlinearities with a Kalman filter [32]. Modeling the NLSEwith a Volterra series [33] leads to an inverse transfer function for the NLSE,yet as DBP, such compensation method is computational expensive.

Further, constellation shaping increases transmission data rates [34, 35], bychoosing a modulation format design with non-equidistant and non-equiprobablesymbols, known as geometric and probabilistic shaping, respectively. Basedon insight from fiber channel models, constellation shaping can also be usedto mitigate signal dependent nonlinear effects [34].

By adding space as another multiplexing dimension [36, 37], space-divisionmultiplexing (SDM) addresses the scaling problem and increases the numberof parallel channels of optical communications systems. However, SDM avoidsaddressing of the nonlinear Shannon limit directly, since within each spatialpath the limit still applies. It is realized with multi-core fibers, multi-modefibers or a combination of the two. Besides more channels, SDM with jointDSP, could lead to cost and energy savings, yet new infrastructure is required,since new fiber types must be installed. For future systems with WDM signalson all spatial paths, resource allocation will be challenging given the growingnumber of interdependent parameters and various complex constraints. How-ever, with latest record breaking experiments [38–40], SDM is an obviouschoice for large capacity future systems.

The nonlinear Fourier transform (NFT) has received significant interestfrom the optical fiber research community, as a method to include fiber non-linearities as part of the transmission. With the NFT, solutions to the NLSEcan be found, which theoretically overcome the performance limitations lead-ing to the nonlinear Shannon limit. It describes a mapping between the time-domain signal and the nonlinear spectrum. Information encoded in nonlinearspectra transforms linearly during transmission over a noise free and losslesslink. However, realistic channels include losses and noise, and their joint per-turbation leads to non trivial distortions in the nonlinear spectrum, makingdata recovery challenging. Further research is required for the NFT to becomea viable transmission method, including new DSP algorithms and analysis ofthe perturbation.

The different directions of addressing the nonlinear Shannon limit requireadvanced fiber models, new DSP algorithms and complex optimization meth-ods for resource allocation. Machine learning methods promise to resolveemerging problems within these research directions. With fiber optic commu-nication systems growing in size and complexity, the set of optimizable param-eters, including modulation formats, optical launch power levels and flexiblechannel spacing, becomes multi-dimensional and interdependent. Here, ma-

4 1 Introduction

chine learning aided optimization is capable of exploring solutions efficientlywhile considering constraints.

For the physical layer, machine learning methods report results for chan-nel equalization, quality of transmission estimation, optical amplification con-trol, optical performance monitoring, modulation format recognition and non-linear mitigation techniques [41, 42]. Most proposed approaches report re-sults based on learning to predict outcomes using historical data, known assupervised learning. Receiver side channel equalization for intensity modu-lation/direct detection (IM/DD) systems has been shown with ordinary andconvolutional neural networks [43–45] and Gaussian mixture models [46]. Forcoherent systems, similar methods are reported with a neural network [47],a variant called extreme learning [48], the k-nearest neighbor algorithm [49],and with the support vector machine algorithm [50]. Machine learning meth-ods have been applied for performance monitoring, where properties, such asthe optical signal-to-noise ratio (OSNR), chromatic dispersion and polariza-tion mode dispersion, are learned with a neural network from various differentsignal properties [51–57].

More advanced methods combine machine learning with domain specificknowledge from optical communication systems. In [58], a time-domain DBPalgorithm is reduced in complexity by pruning the number of filter taps withineach step, and thereafter improved values of the filter taps are obtainedthrough training, similar to the weights of a deep neural network, where alayer is equal to a step of the DBP algorithm. In [59], a similar approachis taken, where a factor graph includes the architecture of a communicationsystem. Eriksson et al. [60] show, one should not apply neural networks andblindly trust the results, since they are capable of learning a pseudorandombinary sequence (PRBS) and hence fabricate performance gain. For a moreextensive review over a wider application range of machine learning in opticalcommunication systems, we refer to two survey articles [41, 42].

1.1 Motivation and Outline of ContributionsIn recent years, machine learning has lead to record breaking results in cog-nitive tasks previously reserved for humans, such that many scientists andengineers have adopted the methods and applied them widely to their fieldsand real-world problems. Although, machine learning techniques such as adap-tive and Bayesian filtering, and probabilistic modeling are not novel in the field

1.1 Motivation and Outline of Contributions 5

of telecommunications, new techniques have emerged and in particular withincreasing computing power, large scale problems are addressable.

In Chapter 3, machine learning methods are introduced with examples spe-cific to telecommunication applications. The reviewed methods are combinedwith present fiber channel models as follow:

Prediction of Fiber Channel Properties

Accurate fiber channel models include exhaustive numerical computations. Inthe study of Chapter 4, the computational demand of a fiber channel model isdecreased by introducing a machine learning algorithm. Computational heavyintermediate channel modeling steps are learned from a sparse dataset. There-after, the trained learning algorithm provides predictions for the computation-intensive steps. With a machine learning aided model and its reduced process-ing requirement, real-time analysis becomes feasible.

Neural Network Receiver for NFDM Systems

The NFT is derived under a theoretic noise free and lossless channel assump-tion. In Chapter 5, a new approach of detecting NFT derived signals is pro-posed, which takes distortions occurring in a realistic channel into account.The study shows that the statistics of the accumulated distortions of thechannel are not additive Gaussian, and the performance of the machine learn-ing aided receiver presents a performance benchmark for future NFT receiveralgorithms.

Geometric Constellation Shaping via End-to-end Learning

A constellation shape for the nonlinear optical fiber channel must take amplifi-cation noise and signal dependent nonlinear effect into account. In Chapter 6,a fiber channel model is embedded within a machine learning algorithm andoptimized end-to-end, such that the learning algorithm captures the statisticalproperties of the channel. The learned constellation shape is jointly robustto both channel impairments. The proposed method presents how machinelearning can be combined with domain specific knowledge from optical com-munications.

6 1 Introduction

1.2 Structure of the ThesisThe remaining six chapters are organized as follows:

Chapter 2 presents the fundamental building blocks of a coherent opticalcommunication system. Transmitter and receiver concepts are discussed, theadditive white Gaussian noise (AWGN) channel, and how the propagationof WDM signals through optical fiber are modeled, where four different ap-proaches are presented and compared in performance. Further, performancemetrics, used in later chapters for assessing optical communication systems,are discussed. The last two sections introduce more specific topics, NFT-basedtransmission and constellation shaping, which are combined with machinelearning methods as part of this works contributions.

Chapter 3 provides an introduction into machine learning, with the focuson neural networks. Different concepts of machine learning are discussed be-fore the components, structure, training and testing of a neural network isexplained. Two different applications of supervised learning are illustratedwith examples using neural networks. Unsupervised learning methods, theprincipal component analysis (PCA) and the autoencoder (AE), are discussed,and the last section discusses the implications of deep learning for this work.

Chapter 4 presents a two step machine learning framework. First thedimensionality of a dataset is reduced, which facilitates the second step –learning to interpolate data samples. The framework is applied on a dataset,which reflects the relationship between physical layer parameters and compu-tationally expensive properties of a fiber channel model. The interpolation ofthe data removes the computational expensive step within the fiber model.

Chapter 5 proposes a machine learning based receiver for nonlinear frequency-division multiplexing (NFDM) signals. Although chromatic dispersion andnonlinear effects are mutually countered within NFT-based transmission, am-plification noise and the fiber loss introduce non-trivial distortions. Simulationstudies are conducted and it is shown that a neural network based receiveris capable of handling the distortions, and detecting the received signals withimproved performance compared to the standard NFT-based receiver.

Chapter 6 uses an unsupervised learning method in combination with op-tical fiber channel models to learn a geometrical constellation shape. Further,the applied method is generalized to optimize for the optimal launch powersimultaneously. The learned geometrically shaped constellations are used forsimulation and experimental studies. It is shown that the machine learningalgorithm captures the impairments present in the channel model and thelearned constellation shapes outperform state-of-the-art geometrical constella-

1.2 Structure of the Thesis 7

tion shapes.Chapter 7 summarizes the results of this work, and presents future direc-

tions and open challenges of the presented methods.

8

CHAPTER 2Coherent

communicationsystems

In this chapter, the building blocks of a coherent optical communication sys-tem are introduced. The main concept is to transmit bits of information frompoint A to B through an optical fiber. Light of a laser source is externallymodulated by an electrical signal driving a modulator and thereafter trans-mitted over the fiber channel. During transmission, the signal is amplifiedmultiple times to compensate for the fiber loss by erbium-doped fiber ampli-fiers (EDFAs) or Raman amplifiers. The fiber channel is dispersive and powerdependent Kerr nonlinearities occur. At the receiving end, the optical sig-nal is mapped into the electrical domain via coherent detection. The digitalbit sequence is extracted from the electrical signal, with a set of signal re-covery algorithms. Typically, forward error correction (FEC) is employed atthe transmitter such that errors are potentially corrected in the received bitsequence. Compared to intensity modulation/direct detection (IM/DD) sys-tems, where only the amplitude is modulated, coherent systems allow higherspectral efficiency.

Section 2.1 describes how the bits are converted into an optical wave-form carrying the information. In Section 2.2 the additive white Gaussiannoise (AWGN) channel model is explained. In Section 2.3, the optical fiberchannel is described and its various simulation methods, ranging from com-putational heavy numerical methods to derived mathematical models. Sec-tion 2.4 describes the necessary steps of extracting the information from thereceived and distorted waveform. Section 2.5 describes metrics assessing theperformance of coherent communication systems. Section 2.6 briefly intro-duces nonlinear Fourier transform (NFT) based transmission and Section 2.7

10 2 Coherent communication systems

ElectricalSignal

FEC encoding

Bit to Symbol Mapping

Laser

Pulse-shaping

Modulation

Channel

Coherent Frontend

DSP

FEC decoding

ElectricalSignal

OpticalSignal

010101001

010101001

ReceiverTransmitter

Figure 2.1: Example of an optical fiber communication system.

explains constellation shaping.

2.1 The TransmitterThis section describes the transmitter components of a coherent optical com-munication system.

2.1.1 LasersBefore transmitting a data signal over an optical fiber, an optical carrier is re-quired, ideally a continuous lightwave with constant amplitude and frequency,such that the data signal is not distorted by its carrier [61]. A device generat-ing an ideal light source does not exist. However, stable lasers with linewidthsin the MHz region meet most demands of an optical carrier. These lasers areused for direct modulated systems and systems with external modulation.

Direct modulation typically controls the electrical current driving the laser,such that the laser emits the pattern of the data signal. Such a modulationformat is called on-off keying or more generally amplitude-shift keying. Di-rect modulation is attractive for short-reach applications, due to the small

2.1 The Transmitter 11

hardware size and low power consumption. For long transmission distances,external modulators are used. They enable faster and more sensitive coherenttransmission, where the amplitude and phase of the carrier are modulated.

The spectra of a laser is not a single spectral line, but holds a broaderspectral linewidth. The combined linewidth of lasers at the transmitter andreceiver, determine the amount of phase noise added to the carried signal.Phase noise is typically described as a Wiener process [62],

ψ[k + 1] = ψ[k] + n[k], (2.1)

where ψ[k] is the phase noise at time k and n[k] ∼ N (0, σ2ψ) is an independent

and identically distributed Gaussian random variable. The variance of theGaussian variable is determined by:

σ2ψ = 2π∆fTs, (2.2)

where Ts is the symbol period and ∆f is the sum of carrier linewidth and localoscillators lasers. After reception, the carrier phase noise must be estimatedto obtain the transmitted signal. Section 2.4.2 on the digital signal processing(DSP) chain includes a discussion on carrier phase estimation, compensatingfor phase noise.

2.1.2 Modulation FormatsA modulation format is a finite set of complex values symbols, S = sm, m =1, ...,M, with M = 2b and b is the number of bits per symbol. In thiswork, M -ary quadrature amplitude modulation (QAM) and iterative polarmodulation (IPM)-based geometric shaped constellations are used [63, 64]. InChapter 3, a real valued modulation format binary phase shift keying (BPSK)is used within an example. Constellation shaping is discussed in Section 2.7.

2.1.3 Electrical Signal GenerationAfter applying FEC to the bit sequence, the bits are mapped to symbols of theused modulation format. The independent and identically distributed symbolsequence is pulse-shaped to an upsampled representation holding the sameinformation [63],

uE(t) =∞∑

−∞skg(t− kTs), (2.3)


where sk is the symbol at time kTs and g(t) is the pulse-shape. The band-width of the obtained signal is controlled with the pulse-shaping filter. Raisedcosine and root-raised cosine pulse-shapes can be used for transmission in op-tical communication. They enable intersymbol interference-free transmissionand control over the spectral edge through the roll-off factor of the filter [63].Low roll-off factors lead to sharp spectral edges and allow densely packedwavelength-division multiplexing (WDM) channels [65]. More elaborate pulse-shapes have been proposed for optical communication with increased nonlin-earity tolerance [66].

Further, signal processing in the electrical domain enables chromatic dis-persion pre-compensation [67] and nonlinear mitigation techniques [68, 69].The limited resolution of digital-to-analog converters (DACs) introduces quan-tization noise [70] and the frequency response of the transmitter [71], leads toa non-ideal signal.

2.1.4 The Optical ModulatorIn coherent systems the in-phase (I) and quadrature (Q) components of the op-tical signal carry information, which is equivalent to amplitude and phase [72, 73].An I/Q modulator is comprised of two Mach-Zehnder modulators (MZMs).As depicted in Fig. 2.2, the light of a laser source is split into two arms, eachholding a MZM. MZMs are amplitude modulators and only modulate onedimension. With an additional π/2 phase shift in one of the arms, the recom-bined signals are orthogonal and thereby two dimensions are modulated. Fordual-polarization transmission the described scheme is duplicated for both Xand Y polarization. Under ideal conditions, the output of the modulator is adual-polarized optical signal in passband, U(0, t) = [ux(0, t), uy(0, t)]T , where

Mach-Zehnder Modulator

Mach-Zehnder Modulator

π/2

I

Q

I/Q Modulator

Figure 2.2: Optical I/Q modulator.

2.2 The Additive White Gaussian Noise Channel 13

both polarizations hold a signal as in (2.3).However, a MZM has a sinusoidal transfer function, which is either ad-

dressed by operating in a linear operating region or by pre-compensating it elec-trically [71]. The former leads to a lower optical signal-to-noise ratio (OSNR)at the transmitter.

2.2 The Additive White Gaussian NoiseChannel

The AWGN channel is a powerful channel model. It adds white Gaussiannoise to the transmitted symbols. The channel is given by [63]:

r[k] = s[k] + n[k], (2.4)

where s[k] is a symbol drawn from the constellation of a modulation format,r[k] is the observation and n[k] ∼ N(0, σ2

AWGN) the noise sample, all at timestep k. The AWGN channel presented here is on symbol level, yet for ageneralization to a system with continuous waveform, the variance of the noisesource must be adjusted according to the system bandwidth due to the highersampling rate.

A fiber optic link can be modeled by the AWGN channel, if nonlinear ef-fects are neglected and the variance σ2

AWGN is determined by the amplificationnoise accumulating throughout the whole link from multiple EDFAs. For theAWGN channel model optimal shaping gain and optimal decision regions forthe maximum likelihood receiver are known.

2.3 The Optical Fiber ChannelThe optical fiber channel is a nonlinear channel. In Fig. 2.3, a typical trans-mission setup composed of NSp fiber spans is shown. Within the span, thetransmitted waveform is exposed to loss, chromatic dispersion and nonlineareffects. The nonlinear Schrödinger equation (NLSE) models these effects. TheNLSE is a differential equation describing how light propagates through a stan-dard single-mode fiber (SSMF) and how it evolves w.r.t. space z and time t.The light pulse envelope A(z, t) satisfies the NLSE, as follows [15]:

dA

dz= −α

2A− i

β22d2A

dt2+ iγ|A|2A, (2.5)


Transmitter Receiver

Fiber Optic Channel

x NSp

Fiber EDFA

010101001

DigitalInformation

010101001

DigitalInformation

Figure 2.3: Example of an optical fiber communication system with EDFAamplification.

where α and β2 account for fiber loss and second-order chromatic dispersioneffects. Third-order chromatic dispersion effects are neglected. The nonlinearcoefficient γ = 2πn2/(λcAeff) is defined by the nonlinear-index coefficient n2,the center wavelength λc and the effective core area Aeff of the fiber. Chro-matic dispersion gives rise to pulse broadening in time-domain within a singlechannel and walk-off across WDM channels, both due to different propagationspeeds of light at different wavelengths. The Kerr effect introduces nonlinear-ities modeled by the nonlinear term, such as self-phase modulation (SPM),cross-phase modulation (XPM) and four-wave mixing (FWM).

For dual-polarized lightwaves, the NLSE is simplified by the Manakovequations, if dispersion and nonlinear lengths are much larger than the bire-fringence correlation length [74]:

dAxdz

= −α

2Ax − i

β22d2Axdt2

+ i89γ(|Ax|2 + |Ay|2)Ax, (2.6a)

dAydz

= −α

2Ay − i

β22d2Aydt2

+ i89γ(|Ax|2 + |Ay|2)Ay, (2.6b)

where Ax(z, t) and Ay(z, t), or as vector A(z, t), are the pulse envelopes oftwo orthogonal light polarizations. The random birefringence along the fiberleads to many random unitary rotations of the two polarizations. The factor8/9 of the expression accounts for an average of the interplay between thenonlinearities and the continuously rotating polarization fields over long fiberlengths. Polarization mode dispersion is not considered in (2.6).

Typical parameters for such systems are given in Table 2.1, where D =−2πcβ2/λ

2c . The | · |2 terms in (2.5) and (2.6) reflect the dependence of the op-

tical signal on its own power. At high power nonlinear effects are predominantand become performance limiting, if unmanaged.

At the end of each span, the optical signal is amplified to compensate forfiber losses, introducing amplification noise from an EDFA. The noise occursdue to amplified spontaneous emission (ASE) in the amplifier and is typically

2.3 The Optical Fiber Channel 15

Table 2.1: Parameters of SSMF.ParameterFiber loss α 0.2 dB/kmDispersion D 17 ps/(nm km)Nonlinear coefficient γ 1.3 (W km)−1

Carrier wavelength λ 1550 nm

modeled with an AWGN noise source with variance σ2ASE [22],

nsp = G · NF − 12 · (G− 1)

, (2.7a)

σ2ASE = (G− 1) · nsp · h · c

λc·B, (2.7b)

where nsp is the spontaneous-emission factor, G is the amplification gain, NFis the EDFA noise figure, h is the Planck constant, c is the speed of light, λcis the signal center wavelength and B the bandwidth of the system. Otheramplification methods, such as Raman amplification, are not considered inthis work.

2.3.1 Split-step Fourier MethodThe split-step Fourier method (SSFM) is a numerical method to solve nonlin-ear partial differential equations, such as the NLSE and Manakov equations.The SSFM models the propagation of light through the fiber in many con-secutive steps. Typically, each step switches from time-domain to frequencydomain and back. In this way, the linear part of the NLSE is solved withinFourier domain and the nonlinear part in time-domain [75]. The process ofsplitting the NLSE into linear and nonlinear parts introduces a numericalerror. However, by decreasing the step size the error is controllable, and de-pendent on the desired accuracy, the step size can be adjusted. The linearand nonlinear steps are described as follows:

dALdz

= −α

2A− i

β22d2A

dt2, (2.8a)

dANdz

= iγ|A|2A, (2.8b)

where AL(z, t) is the linear part and AN(z, t) the nonlinear part. Now, bothhave an analytical solution; the linear part in Fourier domain and the nonlinear


part in time-domain. For one global step of length h, half of the linear stepis applied, followed by a full nonlinear step, and thereafter the second half ofthe linear step:

AL(z + h

2, ω) = exp

((−α

2− i

β22ω2)h

2

)A(z, ω), (2.9a)

AN(z + h, t) = exp(iγ|AL|2h

)AL(z, t), (2.9b)

A(z + h

2, ω) = exp

((−α

2− i

β22ω2)h

2

)AN(z, ω), (2.9c)

where A(z, ω) is the Fourier transform of A(z, t). For a dual-polarized signal,the same method is applied to the Manakov equations. The SSFM modelsSPM, XPM and FWM accurately, but compared to the other models presentedin this section, the SSFM is computational expensive.

2.3.2 Simple XPM ModelTao et. al. [76] have proposed a simplified fiber model for WDM systems. Itmodels XPM for a dispersion managed link with repeated lumped amplifica-tion, but omits SPM and FWM effects. It is based on the Volterra expansionof the Manakov equations and linearizes the system at the beginning of eachspan after the amplifier. It exploits that in a system with lumped amplifica-tion, most nonlinearities occur in the beginning of the span. We rewrite themodel to accommodate a dispersion unmanaged link.

Consider a WDM system of NCh channels and NSp spans of same lengthLSp. The initial optical field is given by:

A(0, t) =NCh∑m=1

U (m)(0, t)eiΩ(m)t, (2.10)

where Ω(m) are the channel spacings to the channel of interest (COI), andU (m)(0, t) = [u(m)

x (0, t), u(m)y (0, t)]T are the complex amplitudes for both po-

larizations of the WDM channels, where the COI is denoted bym = COI. Themodel is only considering effects that affect the COI by working with the com-plex amplitudes of the individual channels U (m)(z, t). In contrast, the SSFMworks on the whole optical field A(z, t). Thereby, the model also avoids largeupsampling rates, which are required for representing A(z, t) without contra-dicting the sampling theorem. The propagation of the COI within each span


is modeled in two steps. An initial nonlinear step, a multiplication by a timevarying Jones matrix, and a consecutive linear step, a chromatic dispersionfilter. The nonlinear step is given by:

UCOIw ((n− 1)LSp, t) = W (n)(t) · UCOI

wo ((n− 1)LSp, t), (2.11)

where UCOIwo (z, t) denotes the waveform without XPM distortions and UCOI

w (z, t)denotes the waveform with XPM distortions. The consecutive linear step isgiven by:

UCOIwo (nLSp, ω) = exp

(−iβ2

2ω2LSp

)· UCOI

w ((n− 1)LSp, ω), (2.12)

where U (m)(z, ω) is the Fourier transform of U (m)(z, t). The linear step modelsthe chromatic dispersion of one span. The result of the linear step is used forthe nonlinear step of the next span, and the process is repeated until the endof the link. The loss of the link is assumed to be ideally compensated forby EDFAs, such that attenuation and amplification can be neglected as theglobal step always ends right after the amplifier, where also amplification noiseis added.

The Jones matrix holds the contributions of the interfering channels oc-curring in span n. The propagation of the interfering channels neglects nonlin-ear effects and exclusively models chromatic dispersion including the walk-offmodeled by a linear phase shift:

U (m)(z, ω) = exp(

−iβ2z

(12ω2 + Ω(m)ω

))· U (m)(0, ω). (2.13)

The interfering channels populate the Jones matrix as follows:

W (n)(t) = eiϕ(n)(t)

√(1 − |w(n)xy (t)|2)ei∆ϕ(n)(t) w

(n)yx (t)

w(n)xy (t)

√(1 − |w(n)

yx (t)|2)e−i∆ϕ(n)(t)

,(2.14)

where ϕ(n)(t) = (ϕ(n)x (t) + ϕ

(n)y (t))/2 and ∆ϕ(n)(t) = (ϕ(n)

x (t) − ϕ(n)y (t))/2 are

terms of the XPM-induced phase-noise, and w(n)xy/yx(t) are terms of the XPM-


induced polarization cross-talk:

w(n)yx (t) =

∑m∈M

iu(m)x ((n− 1)LSp, t)u∗(m)

y ((n− 1)LSp, t) ⊗ h(m)(t), (2.15a)

w(n)xy (t) =

∑m∈M

iu(m)y ((n− 1)LSp, t)u∗(m)

x ((n− 1)LSp, t) ⊗ h(m)(t), (2.15b)

ϕ(n)x (t) =

∑m∈M

(2|u(m)x ((n− 1)LSp, t)|2 + |u(m)

y ((n− 1)LSp, t)|2) ⊗ h(m)(t),

(2.15c)ϕ(n)y (t) =

∑m∈M

(2|u(m)y ((n− 1)LSp, t)|2 + |u(m)

x ((n− 1)LSp, t)|2) ⊗ h(m)(t),

(2.15d)

H(m)(ω) = 8γ9

×1 − exp

(−αLSp + i∆β(m)

1 ωLSp)

α− i∆β(m)1 ω

, (2.15e)

where ⊗ is the convolution operator, the summations skip the COI with M =1, ..., NCh \ COI, and h(m)(t) is the inverse Fourier transform of the trans-fer function H(m)(ω). The fiber parameters α, γ and ∆β(m)

1 = β(m)1 − βCOI

1represent the fiber loss, the nonlinear coefficient and the differential groupvelocity, respectively. The derivation of (2.15), in particular the nonlineartransfer function H(ω) can be found in [76].

The Jones matrix for the n-th span, (2.14), holds the contributions of theXPM-induced phase-noise and the XPM-induced polarization cross-talk fromthe other channels occuring in the n-th span. For this, the complex amplitudesof the interfering channels are propagated up to the beginning of span n,(2.13). Thereafter, the interfering channels follow multiplications and modulusoperations as in the Manakov equations, (2.15a-2.15d), before they are filteredby a transfer function, (2.15e), which is derived from the linearisation withthe Volterra expansion [76]. The contributions of all interfering channels aresummed up and populate the Jones matrix, (2.14) and (2.15).

For the dispersion managed link, as in [76], the dispersion operationsin (2.12) and (2.13) become unnecessary. An additional summation over allspans in (2.15a-2.15d) results in a single time varying Jones matrix modelingthe whole link.


2.3.3 Gaussian Noise ModelThe Gaussian noise (GN)-model was proposed by Poggiolini et. al. [26], withsimilar results existing since 1993 [77]. The proposed model assumes thatnonlinear effects degrading the transmitted signal can be modeled as additiveGaussian noise. The assumption holds due to the central limit theorem, whichstates that the sum of many independent random variables lead to a singleGaussian distributed random variable. During transmission, due to chromaticdispersion, the random data of the signal increasingly overlap in time, whilethe signal continuously interferes with itself. Hence, samples of the overlappingsignal are modeled by a single Gaussian random variable. (Note that thesignal is only point-wise Gaussian. Joint gaussianity of the signal samples asa whole is not given. Section 2.3.4 describes a more accurate model taking thisproperty into account. Also, the extended Gaussian noise (EGN)-model [30]takes this into account.)

The GN-model, derived through the Manakov equations, leads to a physi-cal explanation for the power spectral density (PSD) of the nonlinear interfer-ence (NLI), based on the nonlinear FWM effect. A hypothetical WDM systemwith a PSD comprised of a finite number of discrete frequency tones is given,as depicted in Fig. 2.4 (top). If all frequency tones in this WDM system inter-act in a FWM manner, such that a triplet of these tones acts as pumps for afourth new tone, the fourth tone can be interpreted as a minuscule part of theNLI, as in Fig. 2.4 (bottom). Adding up all new frequency tones produced byevery possible combination of pump triplets, yields a new PSD manifestingthe whole NLI. Increasing the density of finite discrete frequency tones of theinitial WDM system, such that their spacing becomes infinitesimally small,leads to the full PSD of the NLI. For a link with identical spans and perfectlycompensated loss by EDFAs, the full PSD of the NLI is given by [26]:

GNLI(f) =1627γ2L2

eff·∫ ∞

−∞

∫ ∞

−∞GWDM(f1)GWDM(f2)GWDM(f1 + f2 − f)·

ρ(f1, f2, f) · χ(f1, f2, f)df2df1,

(2.16)

where γ is the nonlinear coefficient, Leff is the effective fiber length, GWDM(f)is the PSD of the WDM signal, ρ accounts for the non-degenerate FWM effi-ciency and χ is the phased-array factor. A frequency tone triplet is describedby f1, f2, f1 + f2 − f, which pump the fourth tone at f . Descriptions of Leff,ρ and χ are given in [26] equations (4-8).


Frequency

Frequency

Pow

er

Pow

er

5 WDM

channels

NLI Power

Spectral DensityFrequency tone

triplet

Receiver

bandpass filter

Figure 2.4: Schematics of the GN-model.

Only the NLI within the spectral band of the COI after the receiver filteraffects the COI. A single evalutation of GNLI(f) is required, with the assump-tion that the NLI is white within the spectral band of the COI. Then, thevariance of the NLI is given by:

σ2GN = Rs ·GNLI(fCOI), (2.17)

where Rs is the baud rate of the COI and fCOI is the center frequency of theCOI.

2.3.4 Nonlinear Interference Noise ModelThe nonlinear interference noise (NLIN)-model proposed by Dar et. al. [27, 28],is derived from a time-domain perspective given in [78]. It explicitly resolvesa shortcoming of the GN-model and adds a term to the variance of the NLIthat depends on the modulation format. With a fully Gaussian modulationformat, the two models are equal.


The GN-model assumes that the NLI is fully characterized by its PSD, sta-tistically independent frequency tones. Dar et. al. argue that frequency toneswithin a WDM channel are uncorrelated but must be statistically dependent,since they represent the same signal. Further, the frequency tones are notjointly Gaussian, since a linear transform of a discrete random variable, doesnot become a Gaussian random variable, and the neighbouring intra-channelfrequency tones are a linear transformation of samples from the modulationformat, modeled by a discrete random variable. If the modulation format ismodeled by a Gaussian random variable, the GN-model has no shortcomings,which underlines that the NLI is dependent on the modulation format.

By taking the statistical dependence among intra-channel frequency tonesinto account, the NLIN-model is derived. The frequency tones of neighbouringWDM channels are statistically independent, and contributions of differentchannels are independently computed. The EGN-model [30] is an extensionof the above two models, including additional cross wavelength interactions,which are only significant in densely packed WDM systems. For the systeminvestigated in this work, it is sufficient to use the NLIN-model.

The NLIN-model describes the variance of the NLI as follows [28]:

σ2NLIN =

∑m∈M

GN-model︷︸︸︷P 3

TxX(m)1 +

mod. format dependent︷︸︸︷P 3

TxX(m)2

(⟨|b(m)|4⟩⟨|b(m)|2⟩2 − 2

), (2.18)

where NCh is the number of WDM channels, the summation skips the COIwith M = 1, ..., NCh\COI, PTx is the launch power of a single channel, withthe assumption that all WDM channels have the same launch power, X(m)

1and X

(m)2 are constants depending on the system, and b(m) are the symbols

of the m-th interfering channel. Expressions for X(m)1 and X

(m)2 are found

in [27] equations (26-27), they are dependent on the number of spans NSp, thespan length LSp, the fiber loss α, the chromatic dispersion β2, the nonlinearcoefficient γ and the channel spacing Ω(m).

2.3.5 ComparisonIn Fig. 2.5, the presented fiber models are compared. It shows the obtainedperformance of a 5 WDM channel system of 10 spans of each 100 km length,simulated with the presented fiber channel models. All models but the SSFMconsider only inter-channel nonlinear effects. For fair comparison, at the re-


-5 -4 -3 -2 -1 0 1 2 3 4 5

Launch Power per Channel [dBm]

14

15

16

17

18

19

eff

ective

SN

R

SSFM with single channel DBP

Simple XPM-model

NLIN-model

GN-model

Figure 2.5: Effective SNR in respect to the per channel launch power forthe center channel of a WDM system, simulated with the fourpresented fiber channel models. A 5 WDM channel system with50 GHz channel spacing is modeled with fiber properties of aSSMF.

ceiver of the signal propagated with the SSFM, intra-channel nonlinearity com-pensation is applied through digital back-propagation (DBP). The SSFM, thesimple XPM-model and the NLIN-model show almost identical performance.The GN-model overestimates the NLI.

The SSFM and the simple XPM-model yield a waveform at the outputof the channel, whereas the GN-model and the NLIN-model are on symbollevel. The SSFM models the transmission of an arbitrary waveform throughSSMF. It is the most accurate model, but also requires the most computationtime. The simple XPM model emulates the transmission of a single channelof the WDM system of a dispersion unmanaged link. The method allows tolimit calculations of the interactions towards the COI and acts on the complexamplitudes of the individual WDM signals.

The GN-model simulates the NLI as a memoryless AWGN term dependenton the launch power per channel. The NLIN-model includes modulation de-pendent effects. The latter allows more accurate analysis of non-conventionalmodulation schemes, such as probabilistic and geometric shaped constellations,while the GN-model allows the study under an AWGN channel assumption. Inthis work, equation (2.18) is used to compute the variance of the NLI for both

2.4 The Receiver 23

models. The second term with dependence on the fourth order moment µ4is neglected for the GN-model. The NLI variance is also dependent on intra-channel nonlinear effects, which are not described in (2.18), but included inlater chapters. The intra-channel nonlinear effects are also dependent on thesixth order moment µ6 of the constellation. Hence, the NLIN-model varianceis described by σ2

NLIN(Ptx, µ4, µ6), whereas the GN-model variance is describedby σ2

GN(Ptx).A per symbol description of the memoryless NLIN-model follows:

y[k] = cNLIN(x[k], Ptx, µ4, µ6), (2.19a)= x[k] + nASE[k] + nNLIN[k], (2.19b)

where x[k] and y[k] are the transmitted and received symbols at time k,cNLIN(·) is the channel model, and nASE[k] ∼ N(0, σ2

ASE) and nNLIN[k] ∼N(0, σ2

NLIN(·)) are Gaussian noise samples with variances σ2ASE and σ2

NLIN(·),respectively. For the GN-model, σ2

NLIN(·) is replaced with σ2GN(·) and the chan-

nel model only depends on the power cGN(x[k], Ptx). We refer to σ2NLIN/GN(·)

as a function of the optical launch power and moments of the constellation,since these parameters are optimized in Chapter 6.

2.4 The Receiver

2.4.1 Coherent DetectionThe coherent front-end of a coherent receiver converts the optical signal backinto the electrical domain [8]. The received optical signal is split for dual-polarization reception. Both polarization are mixed with the local oscillatorin a 90 hybrid removing the carrier from the signal. The resulting signals aredetected by balanced photodiodes and converted into the electrical domain byan analog-to-digital converter (ADC) [70]. All modulated signals are recovered,I and Q components of both polarizations. Thereafter, the signal is digitallyprocessed to recover the data.

2.4.2 Digital Signal ProcessingThe DSP chain extracts the digital bits from the received electrical signal.The main concepts are explained as follows.


Chromatic dispersion compensation: Before coherent systems, chromaticdispersion was either tolerated, or compensated for optically with inline dis-persion compensating fiber [79]. In coherent systems, with both amplitudeand phase converted into electrical domain, chromatic dispersion is elegantlycompensated for by a linear filter. The number of filter taps is dependent onthe chromatic dispersion length. Residual chromatic dispersion is taken careof in the adaptive equalizer. [80, 81]

Timing recovery: Samples of the received signal are periodic but shifted intime, such that the optimal sampling position and clock must be recovered.This is the task of timing recovery algorithms. They use a cost function, whichis minimized to determine the right sampling time, by exploiting symmetriesin the modulation format and the transition between the symbols. [82–84]

Frequency offset compensation: The laser providing the carrier and thelocal oscillator mix during coherent detection. However, the detection is het-erodyne, such that their frequencies are not exactly aligned, leaving the signalwith a continuous frequency component. Frequency offset compensation esti-mates the residual frequency and corrects for its effect. [84]

Adaptive equalizer: All residual linear effects, such as polarization-mixingand polarization mode dispersion, are compensated for by the adaptive equal-izer. It is initialised in data aided mode and switches to decision directedmode after convergence of the filter taps. Also schemes with pilot symbols arepossible. The filter taps are continuously updated to deal with non-stationaryeffects. [9, 85]

Carrier phase noise estimation: As described in Section 2.1.1, the lasersare responsible for a Wiener process, that interferes with the signal as phaserotation. Blind phase search algorithms, intertwined into the adaptive equal-izer [62, 84, 86], and the Viterbi-Viterbi algorithm [87], can track the phasenoise continuously.

Mitigation of nonlinearities: Mitigation techniques for nonlinear effectsinclude DBP, time-domain perturbative nonlinear compensation, frequency-domain Volterra equalizers, time-varying linear equalizers and Kalman fil-ters. [21, 31, 32, 88–90]

Symbol decision and error correction: The symbol is recovered, eitherthrough a hard symbol decision or a probabilistic soft symbol decision. There-after, hard or soft error correction decoding algorithms are applied.

2.5 Performance Metrics 25

2.5 Performance MetricsThis section introduces four metrics for assessing the performance of opticalcommunication systems.

2.5.1 Signal-to-Noise RatioThe ratio between the electrical signal power and the electrical noise powerwithin the bandwidth of the signal. It measures the signal quality in electricaldomain [16],

SNR[dB] = PsPn, (2.20)

where Ps is the signal power and Pn the noise power. In later chapters, theterm effective SNR is used to indicate the received and measured SNR. Theeffective SNR includes noise contributions from transmitter imperfections, am-plification noise and nonlinear effects. In the simulation studies, the transmit-ter is considered ideal. The effective SNR is given by:

1SNReff

= 1SNRTx

+ 1SNRASE

+ 1SNRNLI

, (2.21)

where SNRTx is the SNR at the transmitter before transmission, SNRASEcontributes to the noise added by the EDFA amplification stages and SNRNLIcontributes to the NLI.

2.5.2 Optical Signal-to-Noise RatioThe ratio between the optical signal power and the optical noise power withina 12.5 GHz bandwidth. The OSNR is in direct relationship to the SNR [16],

OSNR[dB] = SNR[dB] + 10 · log10

(pBe2Bo

), (2.22)

where p is the number of polarizations, Be is the electrical signal bandwidthand Bo=12.5 GHz is the reference bandwidth.

2.5.3 Pre-FEC Bit Error RatioThe bit error ratio before error correction is simply the number of bit errorsafter hard symbol decision divided by the total number of bits.


2.5.4 Mutual InformationThe mutual information (MI) is a statistical input to output relationship ofthe received symbols [91]:

I(X;Y ) = E

[log2

pY |X(y|x)pY (y)

], (2.23)

where x and y are the transmitted and received symbols, pY |X(y|x) is the con-ditional probability density function of the channel output given the channelinput and pY (y) is the marginal probability density function of the channeloutput.

The MI provides the rate, achievable by the best receiver. In fiber opticsystems, the conditional probability density function of the channel output isnot known, therefore an optimal receiver cannot be determined. Hence, for itsestimation, an auxiliary channel model assumption is required [35, 92], whichmost often is assumed to be Gaussian. Due to the assumption, the resultingestimate is always a lower bound for the MI. For a given modulation format,the MI gives the maximum achievable rate independent of the bit to symbolmapping, whereas generalized mutual information (GMI) gives the maximumachievable rate for a given format with a given bit to symbol mapping [93].

2.6 Nonlinear Fourier Transform BasedCommunication Systems

The NFT is a novel method of addressing the capacity limiting Kerr nonlin-earities in optical communication systems [94]. It incorporates nonlinearitiesas an element of the transmission by exploiting the property of integrability ofthe lossless nonlinear Schrödinger equation, and for dual-polarization of thelossless Manakov equations. The NFT associates a nonlinear spectrum, a con-tinuous and discrete part, to a signal in time-domain [95–97]. A linear trans-formation describes the evolution of the spectrum upon spatial propagationin the nonlinear fiber channel. At the receiver, the inverse linear transforma-tion allows to recover the data encoded in the nonlinear spectrum. Encodingthe data in the nonlinear spectrum is known as nonlinear frequency-divisionmultiplexing (NFDM) [98]. Transmission using NFT has been experimentallydemonstrated for single and dual-polarization [99–104].

2.6 Nonlinear Fourier Transform Based Communication Systems 27

X-Polarization Y-Polarization

Re[ ]

Im[ ]

2

1

Scattering

coefficients

Discrete

eigenvalues

Re[b1(1)]

Im[b1(1)]

Re[b1(2)]

Im[b1(2)]

Re[b2(1)]

Im[b2(1)]

Re[b2(2)]

Im[b2(2)]

Figure 2.6: Discrete eigenvalues, λ1 = j0.3 and λ2 = j0.6, and scattering co-efficients of both polarizations b1,2(λi), i = 1, 2. The scatteringcoefficients associated to λ1 are chosen from a QPSK constella-tion rotated by π/4, while the scattering coefficients associatedwith λ2 are chosen from a QPSK constellation without rotation.

In this work, we use NFDM with information only carried by the discretespectrum as complex eigenvalues λi and their spectral amplitudes, also calledscattering coefficients bp(λi), with p = 1, ..., P and i = 1, ..., NEig, where P isthe number of polarizations and NEig the number of eigenvalues. In the time-domain, complex eigenvalues and a set of scattering coefficients correspond toa high-order soliton pulse.

In Fig. 2.6, the nonlinear spectrum is illustrated with NEig=2 discreteeigenvalues, with QPSK modulated scattering coefficients of order M=4 inP=2 polarizations. In this example, there are four QPSK constellations.Choosing a single symbol from each constellation, refers to an unique second-order soliton pulse, leading to MP ·NEig = 42·2 = 256 different combinations(in time domain 256 different waveforms), enabling transmission of 8 bits persoliton pulse. A train of soliton pulses is generated, such that they do not over-lap in time. During propagation of a trace using the SSFM, the scattering


coefficients evolve as follows:

bp(λi, z) = bp(λi, 0) · e−4iλ2z, (2.24)

where z is the propagated length. After reception of amplitude and phase ofthe solitons, the NFT is used to map the solitons back into the nonlinear spec-trum, where the inverse linear transformation of (2.24) is applied to recoverthe four QPSK symbols. During propagation of a trace including losses andnoise, the evolution as in (2.24) no longer holds, and the performance of thesystem is degraded. The received solitons are distorted and the compensationwith the inverse linear transform is non-ideal. The effects of fiber losses andnoise on soliton transmission and NFT systems have been studied [105–111].

2.7 Constellation ShapingProbabilistic and geometric shaping are techniques that achieve gain throughchanging the symbol probabilities or the position of the symbols of the constel-lation [35, 112]. Under the assumption that the constellation of a transmissionsystem is a continuous Gaussian random variable, a shaping gain of 1.53 dB forthe AWGN channel is achieved. The digital nature of the information forcesthe constellation to be a discrete random variable of finite size. Yet, modelingthe symbols Gaussian-like, still yields shaping gain. Rearranging the symbolsor making symbols closer to the origin more probable, accomplish Gaussian-like shapes. Increasing the number of symbols in the constellation gives morefreedom to model Gaussian-like constellations and leads to potentially moreshaping gain.

Probabilistic shaping under an AWGN channel assumption is discussedin [35, 113] and experimentally demonstrated in [114–116]. Geometric shapeshave been proposed in [64, 117–119] and with experimental results in [120].Comparisons between probabilistic and geometric shaping are discussed in [121–123]. A combined shaping approach is proposed in [124]. The constellationshape can be optimized for the MI and GMI, where the latter is more challeng-ing since the optimization must take the bit to symbol mapping into account.Shaping methods optimized for the GMI are discussed in [125–128].

For the fiber optic channel, the achievable shaping gain is unknown dueto the channel nonlinearities. Yet, semi-analytical fiber models show [27, 30],that the signal dependent nonlinear impairment is controlled by stochasticproperties of the constellation in use. Geometric and probabilistic constella-

2.7 Constellation Shaping 29

tion shapes with nonlinear tolerance have been proposed [34, 129–133] , andalso methods considering channels with memory [134].

30

CHAPTER 3Machine Learning

MethodsAround 2010, deep learning [135, 136] had its breakthrough after more than adecade of artificial intelligence (AI) winter1. When the so called AlexNet [137]won the ImageNet [138] Large Scale Visual Recognition Challenge. AlexNetis based on deep convolutional neural networks and improved the error rateto 15.3% with a margin of 10.8% to the follow up competitor. The key forsuccess was a combination of different elaborate techniques.

First, the implementation of a learning algorithm on a graphics processingunit (GPU) architecture increased the training speed and hence the amount ofdata it can be trained on. Second, non-saturating activation functions [139] im-proved the training process of the deep architectures, and finally, a stochasticmethod, called Dropout [140], led to reduced overfitting. Thereafter, deep neu-ral networks also produced record-breaking results in machine translation andspeech recognition [141–143], which are now part of our everyday life. How-ever, machine learning and neural network have been around for decades [144–152]. The idea of a perceptron, a main building block of neural networks, datesback to 1943 and 1958 [153–155]. It is inspired by the nervous activity in thehuman brain.

Other machine learning algorithms exist and are widely used, such as lin-ear models, k-means clustering, Gaussian mixture models, the expectation-maximization algorithm, Markov models, support vector machines and Gaus-sian processes [149, 156, 157], but they have not received as much attentionas deep neural networks [158]. The rapid growth of machine learning in re-cent years is not only due to record breaking results, but also open sourcesoftware (TensorFlow and PyTorch [13, 159]), open publishing communities,open access research, and a vast amount of online resources, such as online

1https://en.wikipedia.org/wiki/AI_winter

https://en.wikipedia.org/wiki/AI_winter

32 3 Machine Learning Methods

courses [160], blogs of leading scientists [161–166] and even podcasts inter-viewing leading scientists [167].

This chapter introduces the machine learning algorithms used in the laterchapters, with the focus on neural networks. Section 3.1 provides a generaloverview of the different machine learning concepts. In Section 3.2, the struc-ture and the training of neural networks is explained. Regularization and therole of the activation function is also discussed in this Section. Examples forregression and classification using neural networks are studied in Sections 3.3and 3.4. The principal component analysis (PCA) and the autoencoder (AE)are discussed in Section 3.5 as examples of dimension reduction/feature extrac-tion algorithms. The implications of deep learning for this work, the automaticdifferentiation algorithm and momentum aided optimization are discussed inSection 3.6.

3.1 Learning AlgorithmsMachine learning is a field in computer science that enables computers to learnfrom data or in interaction with a real or virtual environment through statis-tical methods. It is commonly categorised into three sub-fields. Supervised,unsupervised and reinforcement learning.

3.1.1 Supervised LearningSupervised learning uses a human labeled dataset to train an algorithm, whichthereafter labels unseen data. The prime example is image classification [137].Given a dataset of pictures of cats and dogs, a trained algorithm should be ableto distinguish a new picture of a cat or a dog. For training, the initial dataset israndomly split into three. A larger part used for training, the training set, andtwo smaller parts used for testing and validation, the test and validation set.The error of the trained algorithm on the test set, reflects its performanceon unseen data during the training process. A better performance on thetraining set than on the test set indicates overfitting of the algorithm. Theperformance on the validation set, is used after the training is completed, forcomparing different algorithms, machine learning models architectures andhyperparameters.

Learning to put animal pictures into categories, is a classification problem.In contrast, if the learning algorithm learns to predict a continuous value,

3.1 Learning Algorithms 33

it is solving a regression problem. Section 3.3 and 3.4 show examples of aregression and classification problem. In supervised learning, the quality ofthe dataset is important. Since the algorithm is not explicitly programmed,insufficient data or falsely labeled data will lead to an incorrect algorithm.

In Chapter 4, we use regression to predict continuously varying coefficientsof a linear combination to predict second order moments of a memory effectin the nonlinear fiber channel. In Chapter 5, we classify high-order solitonsafter transmission.

3.1.2 Unsupervised LearningCompared to supervised learning, unsupervised learning works with unlabelleddata. It aims to find a set of descriptive patterns in the dataset, called features.Thereafter, only the informative features are used, for example by discardingthe raw dataset, but keeping the features of each datum, resulting in a newdataset with reduced redundancy.

With the earlier example of a dataset with pictures of cats and dogs, anunsupervised learning algorithm finds that there are two different categoriesrepresented in the dataset. It recognises that the pictures of dogs have some-thing in common, and that there is a contrast to the pictures of cats. This isalso referred to as clustering [168], such as cluster A for dogs and cluster Bfor cats. If the learning algorithm is repeated, the result could be vice versa,cluster A for cats and cluster B for dogs. Hence, unsupervised learning workswithin the data without reference given from the outside, whereas supervisedlearning works from outside on the data with human input as reference.

A clustering procedure can be used for anomaly/outlier detection [169].Given the dataset of cats and dogs includes unknowingly an additional batchof pictures of lamas. Then the unsupervised learning algorithm should detectthree different clusters instead of two: A, B and C for cats, dogs and lamas,which makes the anomaly visible.

Another application of unsupervised learning is face recognition [170]. Be-fore an algorithm compares two human faces, their pictures have to be trans-lated into a domain in which they are mathematically comparable. A goodface recognition system has learned, from a dataset of human faces, to extractfeatures that preserve the most distinguishable patterns of a human face. Inthis way, the system translates every human face into a vector representationof these most important features. Faces are now distinguished, by comparingtheir feature vectors. Two feature vectors of different pictures, but of the same


person, will be close to each other in the feature space, given a metric, suchas euclidean distance.

Dimension reduction algorithms such as the PCA and the AE are explainedin Sections 3.5.1 and 3.5.2. In Chapter 4, principal components are extractedfrom a dataset of second order moments of a memory effect in the nonlinearfiber channel and thereafter a dimensionality reduced representation is usedfor prediction. In Chapter 6, an AE is used to learn a geometric constellationshape.

3.1.3 Reinforcement Learning

Reinforcement learning describes a method that learns by rewarding itself foractions that favour a desired outcome. Most famously, the research teamof Google Deepmind has trained their algorithm AlphaGoZero [171] to learnthe game of Go. It taught itself through self-play and reinforcing/rewardingmoves which led to victory. Reinforcement learning is based on a Markovdecision process [172], a mathematical model for decision making. There aregroups exploring reinforcement learning in the field of telecommunication [173],however, it is not used in this work.

3.2 Neural NetworkNeural networks [149] are built of many nodes. A node has a number of inputs,each of them is multiplied with a different weight. The linear combination ofthe weighted inputs determine the firing strength of the node. The firing rule isdefined by the nonlinear activation function. Multiple nodes in parallel, a layer,have equal inputs but a different set of weights, hence they fire independentlygiven the same input values. Multiple layers in series, where the outputs ofthe previous layer serve as input for the next layer, lead to more complex anddeep neural networks [158].

This section explains the structure of neural networks, how to train andregularize them, and their activation functions.

3.2 Neural Network 35

x1

x2y1

w1

w2

(a)

-5 0 50

0.5

1

(b)

Figure 3.1: (a) A single neuron and (b) a logistic sigmoid activation func-tion.

3.2.1 Structure

A Single Neuron

A single neuron is shown in Fig. 3.1 (a). It has two inputs (x1 and x2),two weights (w1 and w2), a logistic sigmoid activation function σ(·) and anoutput y1. The activation function is depicted in Fig. 3.1 (b). The output isconnected to the inputs as follow:

y1 = σ

( 2∑i=1

wixi

), (3.1a)

σ(x) = 11 + e−x . (3.1b)

The firing strength of the output is clearly dependent on the weights, alsocollectively called the parameters, θ = w(k)

j,i . Fig. 3.2 shows a schematic plotof the weight-space of (3.1a), where each point represents a single realizationof the neuron. For different positions in the weight-space, the firing rule of theneuron changes. Hence, the single neuron can learn a wide range of functions.

A Neural Network with a Single Hidden Layer

A neural network built of several neurons is shown in Fig. 3.3 and given by:


6543210-1-2-3

5

4

3

2

1

0

-1

-2

w1

w2

w = (−2, 4)

010

0.5

10

y 1

x1

0

x2

1

0-10 -10

w = (−2, 1)

010

0.5

10

y 1

x1

0

x2

1

0-10 -10

w = (0, 1)

010

0.5

10

y 1

x1

0

x2

1

0-10 -10

w = (2, 4)

010

0.5

10

y 1

x1

0

x2

1

0-10 -10

w = (−2,−2)

010

0.5

10

y 1

x1

0

x2

1

0-10 -10

w = (3, 2)

010

0.5

10

y 1

x1

0

x2

1

0-10 -10

w = (3,−2)

010

0.5

10

y 1

x1

0

x2

1

0-10 -10

w = (5,−2)

010

0.5

10

y 1

x1

0

x2

1

0-10 -10

w = (5, 4)

010

0.5

10

y 1

x1

0

x2

1

0-10 -10

w = (2, 0)

010

0.5

10

y 1

x1

0

x2

1

0-10 -10

Figure 3.2: This figure visualises the weight-space of a single neuron, equa-tion (3.1a). Each point in the weight-space represents a singlerealisation of the neuron, a 2-D logistic sigmoid function. Thisfigure is a replication of a figure in [174].


hi = σ

2∑j=1

w(1)i,j xj + w

(1)i,0

, (3.2a)

yi =5∑j=1

w(2)i,j hj + w

(2)i,0 , (3.2b)

where the nodes hi are called hidden nodes and collectively hidden layer. Thisneural network has two input nodes xi, one hidden layer with five hiddennodes hi, and an output layer of two nodes yi. The weights w(1)

i,j connect theinput layer with the hidden layer and the weights w(2)

i,j connect the hiddenlayer with the output layer. The weights w(1)

i,0 and w(2)i,0 are the weights of

the biases, which determine an offset. The output layer omits the activationfunction and is a linear combination of the weighted hidden nodes.

The neural network can represent a much wider range of functions than asingle neuron, but also includes vastly more free parameters. A neural networkcan represent arbitrarily complex functions [145], by increasing the numberof hidden nodes, which is equal to increasing the number of free parameters.The amount of free parameters determines its computational complexity andwith an increasing number of parameters, a neural network becomes proneto overfitting. The weight-space picture from the single neuron, depicted inFig. 3.2, does not change. Also for the neural network, a point in its weight-space, a set of fixed weights, represents one possible realization. The wholeweight-space represents all its possible realizations.

x1

x2

h1

h2

h3

h4

h5

y1

y2

1 1 1 Bias

w(1)i,j w

(2)i,j

Hiddenlayer

Inputlayer

Outputlayer

Figure 3.3: A neural network with a input, hidden and output layer holdingtwo, five and two nodes, respectively.


A mathematical more compact representation of a neural network is givenas follows:

y = fθ(x), (3.3)where y is a vector of the output nodes, x is a vector of the input nodes, θ isthe parameter vector and fθ(·) represents the mapping of the neural networkgiven θ. Finding a point in its weight-space, which leads to a neural networkwith a desired function, is called training. The next section will explain thetraining procedure, which is nothing else but a smart search in the weight-space.

3.2.2 TrainingFor a supervised learning task, a neural network is trained with a labeleddataset, a set of input vectors x(n), n = 1, ..., N , and their correspondingoutcome t(n), a set of vectors called targets. The samples of the trainingset are used to adjust the weights of the neural network, such that the outputvector becomes similar to the target vector for every sample of the trainingset,

fθ(x(n)) = y(n) ≈ t(n). (3.4)The first step towards that goal is to define a loss function. The loss is appli-cation specific and should yield an error if (3.4) is not true. For classificationproblems, the cross-entropy loss is typically chosen (3.5a). For regressionproblems, the mean squared error (MSE) is often appropriate (3.5b).

Lcross-entropy(θ) = 1N

N∑n=1

Cross-entropy︷︸︸︷[−

D∑d=1

t(n)d log

(y

(n)d

)]θ︸︷︷︸

Expectation

, (3.5a)

LMSE(θ) = 1N

N∑n=1

Mean squared error︷︸︸︷[1

2D

D∑d=1

|t(n)d − y

(n)d |2

]θ︸︷︷︸

Expectation

, (3.5b)

where N is the number of samples in the training set and D is the numberof output nodes. For the cross-entropy loss function, the output vector of theneural network should always represent a probability vector, this is further dis-cussed in the Appendix A. By minimizing the loss function, given the training


data, the neural network is trained. This is typically achieved by gradient de-scent optimization. The gradient of the loss function w.r.t. the weights of theneural network is computed. With gradient information in the weight-space,the current set of weights is updated, such that the loss is minimized:

θ(l+1) = θ(l) − η∇θL(θ(l)), (3.6)

where η > 0 is the learning rate, l is the training step iteration and ∇θL(·) isthe gradient of the loss function. The gradient of a neural network is efficientlycomputed with the backpropagation algorithm [149, 158]. The backpropaga-tion algorithm applies the differentiation chain rule on the neural networkw.r.t. the weights.

When the whole training set is used for each update step, the process iscalled batch gradient descent, since there is only one data batch. However,the training set is often split into multiple batches, such that the loss functionaverages over a subset of the training set NBatch < N . The whole trainingset is still used, but the first update step is computed with the first batchand the second update step with the second batch and so on. The first batchis used again, after all batches have been used. The number of times theneural network has been trained with all batches is referred to as epoches.This update procedure is commonly referred to as stochastic gradient descent(SGD) [158, 175], since it updates with a smaller batch and obtains an estimateof the gradient rather than the real gradient. Besides adding some uncertaintyto the training, SGD also eases the memory demands of the gradient descentoptimization, due to smaller batches.

After several iterations with either update procedure the weights adjust,such that at best the neural network learns the relationships present in thetraining set. However, the optimization is non-convex [176, 177], and as ageneral rule converges in a local minima. Whether the trained neural networkperforms sufficient is determined with tests and validations.

3.2.3 Test and ValidationThe dataset is split into three parts, the training, test and validation set. Thetest and validation sets are not used during the training process for the mini-mization in (3.6), but the loss of the test set is still computed at each iteration.It is used to estimate the performance of the neural network on unseen data.If the loss of the test set increases while the loss of the training set decreases,the neural network is overfitting on the training set. Regularization methods


combat overfitting and are discussed in Section 3.2.5. The loss of the valida-tion set is computed after training is completed, for comparison of seperatelytrained neural network architectures and different hyperparameters. Grid andmanual search are typically used for hyperparameter optimization [178, 179].

3.2.4 Activation FunctionsThe activation function adds nonlinearity to the neural network, allowing it tolearn nonlinear relationships from the dataset [174]. Two commonly used ac-tivation functions are the logistic sigmoid and hyperbolic tangent, as depictedin Fig. 3.4.

-5 0 5-1

0

1

Figure 3.4: Logistic sigmoid and hyperbolic tangent activation functions.

3.2.5 RegularizationOverfitting describes a situation where the learning algorithm fits the trainingset too well, to a degree that the noise within the training data is reflectedin the prediction. Methods countering overfitting are called regularization.One way of regularization is by adding a term to the loss function, whichpenalises large weights in the neural network. The sum of the loss functionand a regularization term is called cost function [149],

C(θ) = L(θ) + βR(θ), (3.7a)

R(θ = w(k)j,i ) =

∑k,i,j

|w(k)j,i |2, (3.7b)

where β is a hyperparameter controlling the importance of the regularizationterm and k is the layer number. Instead of minimizing the loss L(·), now the

3.3 Regression 41

-5 0 50

0.5

1

Figure 3.5: Logistic sigmoid activation function with varying input scaling.

cost C(·) is minimized. The regularization term R(·) works on the parametersof the neural network and this particular regularization is called L2 regulariza-tion. By forcing the neural network to jointly minimize the loss and keepingthe weights small, it is bound to ignore noisy patterns in the training data.

Fig. 3.5 shows a logistic sigmoid function, σ(ax), for a varying parameter a.As indicated, for a → ∞ the logistic sigmoid function (3.1b) becomes the unitstep function. Considering a neural network, which is comprised of manylogistic sigmoid functions, large weights lead to radical rather than smoothtransitions in the overall learned function. When overfitting occurs, the neuralnetwork models the noise within the training set. To fit noise patterns, theneural network requires large weights modeling step-like transitions. Hence,small weight values keep the neural network from overfitting.

3.3 RegressionThis section will discuss a regression problem addressed with supervised learn-ing. As an example, a neural network is used to fit a sine function. The datasetconsists of noisy samples of a sine function. Fig. 3.6 shows the sine function,with 50 training data points as orange crosses and 5 testing data points de-picted as green circles. A neural network, with one input node, one hiddenlayer of 32 hidden nodes and one output node, is trained to learn from thetraining data points. Fig. 3.7 (a) shows the convergence of the MSE loss func-tion of the training and test set with respect to the training iteration. Fig. 3.7(b), (c), (d), (e) and (f) show the learned prediction at training iterations200, 400, 600, 800 and 1800, respectively. After 600 training iterations theneural networks starts to overfit. It is detected with the loss of the test set,


-6 -4 -2 0 2 4 6x

-1.5

-1

-0.5

0

0.5

1

1.5

sin(

x)

sin(x)Training sampleTesting sample

Figure 3.6: Sine function with noisy training and testing samples.

in Fig. 3.7 (a) both losses initially decrease, but around iteration 600 the lossof the test set starts increasing, whereas the loss of the training set keepsdecreasing. Since both datasets are sampled from the same function and holdthe same pattern, after iteration 600, the learning algorithm must be learninga pattern that is only inherent in the training dataset, that is the noise signalof the training set. At iteration 200-600 in Fig. 3.7 (c) and (d), the learnedfunctions resemble smooth almost sine like curves, whereas at iteration 800-1800, the learned functions start to have more radical bends, which reflect theoverfitting of the learning algorithm, annotated by the arrows in Fig. 3.7 (f).Regularization methods, discussed in Section 3.2.5, counter overfitting.

3.4 ClassificationThis section will discuss a classification problem addressed with supervisedlearning. As an example, a neural network is used to classify and detect aquadrature phase-shift keying (QPSK) signal after transmission over an addi-tive white Gaussian noise (AWGN) channel with a signal-to-noise ratio (SNR)of 8 dB. For a training set, 1000 symbols are uniformly sampled from theQPSK constellation and transmitted through the channel. The dataset con-sists of the received observations and the corresponding label, which associateseach observation to one of the four classes of the respective QPSK symbol. Thelabels are one-hot encoded vectors, as discussed in the Appendix A. The sameprocess is repeated for obtaining the test set. Fig. 3.8 (a) shows the trainingset observations, with colors/markers indicating the four distinct classes.

A neural network with two input nodes (in-phase (I)/quadrature (Q) com-ponents), one layer of 32 hidden nodes and four output nodes (one-hot encodedvector), is trained to learn from the training data points. Fig. 3.8 (b) shows

3.4 Classification 43

500 1000 1500 2000# Iteration

0

0.1

0.2

0.3

0.4

0.5

0.6

Loss

TrainingTestingFig. (b)-(f)

(a)

-6 -4 -2 0 2 4 6x

-1.5

-1

-0.5

0

0.5

1

1.5

sin(

x)

Iteration=200

sin(x)PredictionTraining sampleTesting sample

(b)

-6 -4 -2 0 2 4 6x

-1.5

-1

-0.5

0

0.5

1

1.5

sin(

x)

Iteration=400


(c)

-6 -4 -2 0 2 4 6x

-1.5

-1

-0.5

0

0.5

1

1.5

sin(

x)

Iteration=600


(d)

-6 -4 -2 0 2 4 6x

-1.5

-1

-0.5

0

0.5

1

1.5

sin(

x)

Iteration=800


(e)

-6 -4 -2 0 2 4 6x

-1.5

-1

-0.5

0

0.5

1

1.5

sin(

x)

Iteration=1800


(f)

Figure 3.7: (a) Loss function of the training and test sets with respect tothe training iteration, and the neural network prediction after(b) 200, (c) 400, (d) 600, (e) 800, (f) 1800 training iterations.The arrows in (f) indicate overfitting.


-1.5 0 1.5In-phase

-1.5

0

1.5Q

uadr

atur

e

(a)

0 100 200 300 400 500# Iteration

0

0.2

0.4

0.6

0.8

1

Loss

Training lossTesting loss

(b)

Figure 3.8: (a) Received observations of a QPSK signal with colors/markersindicating the four distinct classes of the respective QPSK sym-bol. (b) Loss function of the training and test sets with respectto the training iteration.

the convergence of the cross-entropy loss function, for both the training setand the test set with respect to the training iteration. After training, theneural network is able to decide on future observations with unknown labels.Fig. 3.9 shows the decision regions learned by the neural network. For anequiprobable QPSK modulation format and the AWGN channel, the optimaldecision boundaries are known to be the x-axis and y-axis, such that a mini-mum distance receiver is optimal [63]. The neural network has not learned theoptimal decision boundaries, because it classifies region in the first quadrant(blue) to the second quadrant (green), as shown in Fig. 3.9. The problem isthat the training data samples only yield an estimate of the real probabilitydistribution of the AWGN channel. Hence, the neural network, only exposedto samples, learns that some of the green/upper left region must be in thefirst quadrant. Adding regularization and increasing the size of the trainingset, will both improve the decision boundaries. However, learning the exactoptimal decision boundaries with a neural network is challenging. The sameproblem is encountered in Chapter 5, where the learned decision boundariesof a soliton transmission is only close to optimal.

3.5 Dimension Reduction and Feature Extraction 45

(a) (b)

Figure 3.9: (a) Decision regions of the trained neural network and (b) azoomed version of (a).

3.5 Dimension Reduction and FeatureExtraction

This section explains the PCA and the AE.

3.5.1 Principal Component Analysis

TheoryThe PCA is a linear dimension reduction algorithm [149, 158]. Given is amean-free dataset x(n), n = 1, ..., N , where each sample is a column vectorof dimension D. Concatenating all samples in a matrix,

D×NX =

[x(1),x(2), ...,x(N)

], (3.8)

allows to define the eigendecomposition of the covariance matrix C of thedataset:

D×DC = E

[(X − E[X])(X − E[X])T

], (3.9a)

= E[X × XT

],

Cpn = λnpn, (3.9b)


where the eigenvalue λn represents the variance of the dataset in direction ofthe respective eigenvector pn. In (3.9a), we use that the dataset has zero mean.The eigenvectors are called principal components of the dataset, are orderedsuch that λ1 < λ2 < ... < λn, and are orthogonal since C is symmetric. Adataset with equivalent redundancy y(n) is obtained with a linear change ofbasis:

D×NX = P × Y , (3.10a)D×DP = [p1,p2, ...,pD] , (3.10b)

D×NY =

[y(1),y(2), ...,y(N)

], (3.10c)

where the principal components act as new basis. An approximate reconstruc-tion of X with the first D principal components is given by:

D×NX = P × Y , (3.11a)D×DP =

[p1,p2, ...,pD

], (3.11b)

D×NY =

[y(1), y(2), ..., y(N)

], (3.11c)

where D < D and y(n) is a new dataset with reduced dimension, where eachsample is of dimension D.

For every D the optimal linear reconstruction is achieved, since the princi-pal components are orthogonal and ordered with decreasing variance. Given Dthe average reconstruction accuracy is:

r(D) =∑D

d=1λd∑D

d=1 λd. (3.12)

The eigenvalues are often called spectra of the PCA, since they reflect theenergy of the signal (the dataset) held by their respective principal component,similar to the Fourier transform, where the spectral amplitudes reflect theenergy held by their respective sine and cosine basis.

ExampleThe matched filter for a pulse-shaped transmission can be estimated using thePCA. Given is a bandlimited communication system, transmitting symbols


0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4Time [ns]

-1

0

1a.

u.

(a)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4Time [ns]

-1

0

1

a.u.

(b)

-0.05 0 0.05Time [ns]

-1

0

1

a.u.

(c)

Figure 3.10: Trace of 5 pulses (a) before and (b) after transmission. (c) 32aligned noisy pulses after transmission.

of a BPSK constellation at symbol rate 10 GBd with a root raised cosinepulse-shaping filter over an AWGN channel with an SNR of 32 dB withinthe signal bandwidth. The system has one limitation, the pulses must benon-overlapping and separated in time. Fig. 3.10 (a) and (b) show a train ofpulses before and after transmission. The received pulses must be matchedfiltered and downsampled to one sample per symbol (one sample per pulse),for optimal detection. In essence, the matched filtering and downsampling areperforming a dimension reduction. For a root raised cosine pulse-shape, the


1 2 3 4 5# Principal Component

0

0.25

0.5

0.75

1

Var

ianc

e pr

opor

tion

(a)

-0.05 0 0.05Time [ns]

-1

0

1

a.u.

Matched filter1st PC2nd PC

(b)

-1 0 1In-phase

0

Qua

drat

ure

(c)

Figure 3.11: (a) PCA spectra. (b) Matched filter and first two principalcomponents. (c) Distorted BPSK symbols extracted by thePCA.

optimal matched filter is again a root raised cosine pulse-shaping filter, whichcan also be obtained by applying the PCA.

For this, the received train of pulses are aligned, as shown in Fig. 3.10 (c),and used as dataset for the PCA. The mean of the dataset is subtracted fromthe samples and the samples are put into the columns of a matrix as describedin (3.8). Thereafter, the eigendecomposition is applied for the PCA.

The PCA spectra, as shown in Fig. 3.11 (a), has one principal compo-nent holding more than 75% of the variance of the dataset. The first and


second principal component are depicted in Fig. 3.11 (b) together with theroot raised cosine pulse-shaping filter. The first principal component is per-fectly aligned with the real matched filter. The second and all other principalcomponents exclusively model noise from the channel. The first row of theY matrix in (3.11c) holds the coefficients of the first principal component.Here, they represent the transmitted BPSK symbols with values around ±1,see Fig. 3.11 (c). Hence, the PCA found the same dimension reduction as thecombined matched filter and downsampling, and extracted the pulse-shapefrom the received signal.

3.5.2 AutoencoderThe AE is a nonlinear dimension reduction algorithm [158, 180]. It is com-posed of two neural networks, encoder and decoder, arranged in sequence, seeFig. 3.12. The aim is to reproduce the input of the encoder x, at the outputof the decoder y. The output of the encoder h, the latent space, is of lowerdimension than the input and output of the AE. The encoder must learna representation of the data, that holds enough information for replicationthrough the decoder.

The AE is trained similar to a single neural network, with the differ-ence that the input data is also used as targets. After training, the encoderhas learned a non-redundant feature extraction/dimension reduction method.Since neural networks include nonlinearities, the AE can learn nonlinear re-lationships in contrast to the PCA, which only extracts linear independentfeatures.

x1

x2

x3

x4

h1

h2

y1 ≈ x1

y2 ≈ x2

y3 ≈ x3

y4 ≈ x4

1 1 1 1

Encoder Decoder

Figure 3.12: An AE composed of two neural networks, encoder and decoder,which share a layer.


By embedding a channel model within the encoder and decoder, the AEcan learn codes and modulation formats in its latent space [181]. In Chapter 6,a fiber channel model is combined with an AE to learn geometric constellationshapes jointly robust to amplification noise and nonlinear effects of the fiberoptic channel.

3.6 Deep LearningWhen the term deep learning is used, typically convolutional neural networksor recurrent neural networks with multiple layers are assumed. However, manyconcepts contributed to deep learning are not necessarily connected to deeparchitectures, but facilitate their training and improve their performance. Forthe applications in this work, neural networks of no more than 3 layers areused, yet many techniques developed for deep architectures have been used.

This section introduces further methods which are contributed to deeplearning and used in later chapters.

3.6.1 Rectified Linear UnitThe rectified linear unit (ReLU) is given by [139, 142, 158, 182]:

ReLU(x) =

0, for x < 0x, for x ≥ 0 . (3.13)

The rise of the ReLU function is due to the vanishing gradients problem [183–185] caused by logistic sigmoid and hyperbolic tangent activation functionsduring training in deep architectures. During the backpropagation algorithmthe gradients of the activation function for each node are passed on to theprevious layer and multiplied with the gradients there. In deep architecturesthis process is repeated several times and the gradients accumulate. The finalproduct is exclusively comprised of values smaller than one, since the gradientsof the logistic sigmoid and hyperbolic tangent functions are strictly smallerthan one. The training of the neural network becomes impractical slow dueto a very small final gradient. In contrast, the ReLU function only has twogradients, zero and one. Such that during backpropagation the gradients areeither on or off. In this way, the ReLU function prevents vanishing gradientsand facilitates training of deep neural networks. Further, the computation ischeaper as there is no exponential function in the ReLU function.

3.6 Deep Learning 51

3.6.2 MomentumAdvanced optimzation algorithms include a momentum term in the gradientupdate step of (3.6). It adds a fraction of the previous update vector tothe most recent gradient, introducing a memory effect. Mathematically, it isdescribed as follows [186]:

v(l) = αMv(l−1) + η∇θL(θ(l)), (3.14a)θ(l+1) = θ(l) − v(l), (3.14b)

where v(l) is the update vector at iteration l, η is the learning rate and αM isa hyperparameter defining the fraction of memory to account for. Momentummakes the training converge faster and helps to escape small local minima.

More elaborate optimization algorithms including momentum, such asthe Adam optimization algorithm [187], are already implemented in Tensor-Flow [13] and PyTorch [159].

3.6.3 Automatic differentiationTensorFlow [13] and PyTorch [159], include a function called automatic dif-ferentiation [188, 189], which is a numerical method to accurately computederivatives of numeric functions. It is similar to the backpropagation algo-rithm [149, 158] for differentiating neural networks, but more general. Math-ematical models are described by a sequence of elementary operations, suchas addition, subtraction, multiplication and division, in combination with el-ementary functions, such as sin(·), cos(·), exp(·) and log(·). Automatic differ-entiation applies the chain rule repeatedly to such a sequence of operationsand functions w.r.t. a set of parameters, such that automatic differentiationallows to determine the gradient of arbitrary deep mathematical structures.The obtained gradients are then used for optimization.

In Chapter 6, we use the automatic differentiation algorithm to calculatethe gradient of a channel model, which makes end-to-end learning possible.

52

CHAPTER 4Prediction of Fiber

Channel PropertiesThis chapter includes results presented in the conference contribution [C6].

4.1 IntroductionAnalytical models of communication systems assess their performance withoutcumbersome numerical simulations. In particular, fully loaded wavelength-division multiplexing (WDM) systems are almost impossible to simulate withthe split-step Fourier method (SSFM). Many mathematical models of WDMsystems have been proposed and some are discussed in Section 2.3. Thesesemi-analytical models predict the system performance very accurately. Theyrequire to compute a set of non-analytical properties, which must be ob-tained through computational expensive numerical integration. Thereafter,the model is analytical, as long as these properties are assumed constant.However, when the system and the physical parameters change, the assump-tion often breaks and the properties must be recalculated. A full assessmentof the system is only possible, if the performance metric is independent ofthe described properties. Since these properties are a function of the physicallayer parameters, their relationship can be learned using supervised learning.Thereafter, the computational expensive numerical integration is skipped andinstead predicted by the learning algorithm. The supervised learning algo-rithm is not limited to learn single parameters, but capable of learning multipleparameters at the same time. Further, functional properties can be learned,such as auto-correlation functions or spectra.

In this chapter, a supervised learning algorithm is trained to predict auto-correlation functions, obtained from the fiber model described in Section 2.3.2.They describe the auto-correlation of one single term of the summation in (2.15),

54 4 Prediction of Fiber Channel Properties

i.e. stochastic process of one interfering channel. It can be used to emulatethe cross-phase modulation (XPM) process from white Gaussian noise sources,instead of generating the interfering waveforms. Similar auto-correlation func-tions are used in [29, 190, 191], for emulation of the optical fiber channelincluding memory effects, and in [32], for a nonlinear interference (NLI) miti-gation algorithm.

4.2 Machine Learning FrameworkThe machine learning framework, shown in Fig. 4.1, is trained to predict theauto-correlation functions generated from the fiber model described in Sec-tion 2.3.2. First, a dataset of multiple auto-correlation functions is generatedby simulations of the fiber channel model, and thereafter the nonlinear map-ping of the physical layer parameters to the generated auto-correlation func-tion is learned. However, the output dimension of the neural network mustbe as large as the length of the auto-correlation function. With the entireauto-correlation function as output vector, the neural network is overly com-plex with a large number of trainable weights. Instead, it would be favourableto reduce the number of output units of the neural network by transform-ing the auto-correlation functions such that less variables carry most of theinformation.

This is achieved with the principal component analysis (PCA), which per-forms a change of basis on the auto-correlation functions, as described inSection 3.5.1. The principal components are chosen such that few of themcover most of the variance of the auto-correlation functions. A linear combi-nation of the chosen principal components approximates the auto-correlationfunctions. Although, by discarding insignificant principal components infor-mation is lost, it also reduces the dimensionality of the dataset. Recoveringthe auto-correlation functions with a subset of D = 3 principal components isgiven by:

R(n)[k] =D∑d=1

c(n)d

pd[k], (4.1)

where R(n)[k] is the n-th auto-correlation function of the dataset with symboldelay k, p

d[k] is the d-th principal component and c(n)

dare the coefficients of

the PCA. Each auto-correlation function is now described by three coefficientsc(n)

1 , c(n)2 , c

(n)3 .

4.2 Machine Learning Framework 55

Fiber Channel Model

...

Data set

Coefficients Principal Components

Neural Network

PCA

Auto-correlation functions

Predict

Train

...

Fig. 4.2 (a)

Equation (3.8)

Figure 4.1: Machine learning framework.

The principal components, which represent pseudo auto-correlation func-tions, are shown in Fig. 4.2 (a). Physical layer parameters, such as launchpower, channel spacing, span length and propagated distance, are used as aninput to the neural network, which predicts the coefficients of the principalcomponents, such that the number of neural network output nodes is reducedto 3. Fig. 4.2 (b) shows an example of a true auto-correlation function ob-tained by the fiber channel model and the prediction by the trained machinelearning framework.

Even though the most important principal components are chosen, theother principal components also hold information about the auto-correlationfunctions. This information is lost during the dimension reduction and cannotbe recovered by the neural network. Hence, there are two error sources, thedimensionality reduction itself and an insufficiently trained neural network.


0 20 40 60 80 100

Symbol delay [k]

-0.3

-0.1

0.1

0.3

0.5

R[k

]

(a)0 20 40 60 80 100

Symbol delay [k]

-0.5

0

0.5

1

1.5

2

2.5

3

R[k

]

10-5

Perturbation Model

PCA/NN Prediction

(b)

Figure 4.2: (a) Three principal components whose linear combination re-cover all auto-correlation functions with an accuracy of 99.3%.An accuracy of 88.5% is obtained by only using the first prin-cipal component. (b) Auto-correlation function as a function ofsymbol delay, obtained from the model and the machine learn-ing framework. The latter is a linear combination of the threeprincipal components in (a).

Table 4.1: Range of physical layer parameters for the generated auto-correlation functions.

Parameter Min. Max.Interfering channel launch power P -2 dBm 4.5 dBmInterfering channel spacing Ω 35 GHz 350 GHzSpan length LSp 40 km 120 kmPropagated distance z′ 0 km 2000 km

4.3 Training and PredictionThe PCA was performed on a dataset of 3000 auto-correlation functions ofXPM-induced phase-noise. Differently shaped auto-correlation functions areobtained by changing the four physical layer parameters within the rangesgiven in Table 4.1. With the PCA the dimensionality of the auto-correlationfunctions is reduced to 3. They hold 99.3% of the variance of the dataset.

An auto-correlation function describes an XPM process, generated by aninterfering channel within a LSp long span, starting at z′ into the link, withΩ channel spacing from the channel of interest (COI) and a launch power ofPTx. Other parameters, such as the modulation format, sample frequency,

4.4 Results and Discussion 57

chromatic dispersion coefficient, nonlinear coefficient, attenuation and roll-offfactor are set to 16 QAM, 32 GHz, 17 ps/(km nm), 1.2 (W km)−1, 0.2 dB/kmand 0.1, respectively. Regarding a WDM system, the auto-correlation functionof an XPM process of multiple interfering channels is obtained by summingthe auto-correlation functions of each interfering channel.

The neural network is trained to learn a continuous nonlinear mappingfrom the input space of physical layer parameters PTx,Ω, LSp, z

′ to the out-put space of 3 coefficients c1, c2, c3 determining the actual auto-correlationfunction. The dataset is normalized and split into training and testing sets.While the training data is used to learn the nonlinear mapping, the test datais used to evaluate the robustness of the fit. The neural network is trainedwith the mean squared error (MSE) loss function given in (3.5b).

4.4 Results and DiscussionThe PCA extracts the main features of the auto-correlation functions fromthe dataset as principal components, see Fig. 4.2 (a). The first principalcomponent p1[k] is most important. It depicts a quickly decaying alwayspositive auto-correlation function with short tail. This feature models thepower of the XPM process. The second and third principal components, p2[k]and p3[k], model the long tail caused by chromatic dispersion. Further, thereconstruction of arbitrary auto-correlation functions also enables the analysisof the peak at R[0] and the half width at half maximum (HWHM) of the auto-correlation function.

Every column in Fig. 4.3 shows how the contribution of the principal com-ponent coefficients changes (top) and how the peak R[0] and HWHM of theauto-correlation function changes (bottom) with continuously changing theparameters. For the prediction in Fig. 4.3, when sweeping one of the fourparameters the others are set to the center of their respective range.

Fig. 4.3 (a,e) show the dependence of the interference to the channel launchpower. Increased power levels yield an increased absolute value of the contri-bution of all coefficients. Since the first principal component models the powerin the interference, it is effected more and the peak R[0] of the auto-correlationfunction increases. Fig. 4.3 (b,f) depict the dependency of the channel spac-ing. Interfering channels with larger channel spacing from the COI have lessimpact. The evolution of the coefficients lead to a longer tail as shown by theHWHM of the auto-correlation function, due to the longer walk-off of the in-terfering channel with larger channel spacing. Fig. 4.3 (c,g) show that the span


-2 0 2 4-2

0

2

4c

1,c

2,c

3

10-4

c1

c2

c3

IC Launch Power P [dBm]

(a)

100 200 300-2

0

2

410

-4

Channel Spacing [GHz]

(b)

40 60 80 100 120

Span Length z'n

[km]

-2

0

2

4

c1,c

2,c

3

10-4 (c)

0 500 1000 1500 2000-2

0

2

410

-4

Propagated Distance L [km]

(d)

-2 0 2 4

IC Launch Power P [dBm]

0

0.5

1

1.5

AC

F P

eak R

[0]

10-4

5

10

15

Prediction

Model

R[0]

HWHM

(e)

100 200 300

Channel Spacing [GHz]

0

0.5

1

1.5

10-4

5

10

15

HW

HM

sym

bol dela

y

(f)

40 60 80 100 120

Span Length z'n

[km]

0

0.5

1

1.5

10-4

5

10

15

AC

F P

eak R

[0]

(g)

0 500 1000 1500 2000

Propagated Distance L [km]

0

0.5

1

1.5

10-4

5

10

15

HW

HM

sym

bol dela

y

(h)

Figure 4.3: (a-d) Contribution of the respective principal component to theoverall auto-correlation function, and (e-h) the auto-correlationfunction peak R[0] and the HWHM for model (squares) andprediction (lines). Descriptions are in the respective first plot.

4.5 Summary 59

length has minor effects on the auto-correlation function, in particular sinceall nonlinear interactions occur at high power levels within the beginning ofthe span. Fig. 4.3 (d,h) show that without chromatic dispersion there is hardlyany memory and with increasing chromatic dispersion the three coefficientsand the HWHM of the auto-correlation function converge.

4.5 SummaryA framework is proposed to learn properties of an optical fiber channel model,which are computationally expensive to estimate. With a sparse dataset, thePCA enables to extract the most important features of the auto-correlationfunction of an interference effect. Thereafter, the neural network is able topredict the learned properties. Together they become a tool to analyze theeffects on a system, when physical layer parameters are changing.

60

CHAPTER 5Neural Network

Receiver for NFDMSystems

This chapter includes results presented in journal and conference contribu-tions [C2, J2].

5.1 IntroductionThis chapter takes a data science approach on receiving high-order solitons, de-rived from the nonlinear Fourier transform (NFT). The transmission conceptis known as nonlinear frequency-division multiplexing (NFDM) and was intro-duced in Section 2.6. As discussed there, under a lossless channel assumption,the information carried by the solitons is perfectly recovered, since a lineartransformation in the NFT domain compensates for a fixed transform of theconstellations. Taking fiber losses and amplification noise into account, thereceived solitons are distorted with unknown effect on the constellations [105–111], and the NFT does no longer provide exact solutions to the nonlinearSchrödinger equation (NLSE) and Manakov equations. After collecting databy transmitting many solitons through the channel, a neural network is ableto learn decision regions in the high dimensional vector space of the solitons,similar to the classification example in Section 3.4. For comparison, we alsoapply a minimum distance receiver in time-domain.

The study of this chapter makes two points. First, the statistics of theaccumulated distortions obtained by the propagated soliton is not additiveGaussian. Second, the NFT-based receiver is outperformed by the neuralnetwork receiver.

62 5 Neural Network Receiver for NFDM Systems

NF

T

Rece

ive

r

BE

R

Decis

ion

Resa

mp

ling

Neu

ral

Netw

ork

Rece

ive

r

Transmission Link

Q

I

MZMMZM

41.7 km

Span

INF

T

NF

T s

ym

bo

l m

ap

pin

g

PR

BS

ge

ne

ratio

n

Coh

ere

nt

Fro

nte

nd

x NSp Spans

Lin

ea

r T

ran

sf.

NF

T

Transmitter

Receiver

EDFA EDFA

BE

R

Decis

ion

Min

imu

m

Dis

tan

ce

Rece

ive

r

BE

R

Decis

ion

Nonlinear

Spectra

Figure 5.1: Simulation setup.

Section 5.2 shows the simulation setup. Section 5.3 introduces the threecompared receivers. Section 5.4.1 compares their performances for the additivewhite Gaussian noise (AWGN) channel. Section 5.4.2 looks at the solitonevolution in time and frequency-domain during propagation. Section 5.4.3compares the receiver performances for the fiber optic channel. A study ofthe neural network hyperparameters is given in Section 5.4.4. Section 5.5summarizes the findings.

5.2 Simulation SetupThe simulation setup is described in Fig. 5.1. Single and dual polariza-tion NFDM systems are simulated. The transmitter generates bits througha pseudorandom binary sequence (PRBS) generator with sequence length211 − 1 = 2047, converts bits to symbols and applies the inverse NFT al-gorithm to map symbols to soliton pulses. Both single and dual polarizationsystems use 2 eigenvalues (λ1 =0.3j and λ2 =0.6j) with quadrature phase-shiftkeying (QPSK) constellations as b(λi) scattering coefficients with unitary ra-dius, resulting in N=16 and N=256 possible waveforms for single and dual

5.3 Receiver Schemes 63

polarization, respectively, as described in Section 2.6. A rotational offset ofπ/4 is applied to the constellations associated to the eigenvalue with largerimaginary part, as depicted in Fig. 2.6. The symbol rate is 1/Ts=1 GBd andan NFT intrinsic normalization parameter is given by T0=30 ps [97]. TheMach-Zehnder modulator (MZM) is driven in pseudo-linear region by the elec-trical signal. The laser is considered ideal, such that the effect of phase noise isneglected. The NFDM-based transmitter generates a time-domain waveformof truncated and thus non-overlapping soliton pulses,

A(t, z=0) =∑k

uk(t− kTs, z), (5.1)

where each uk(t, z) is one of 256 soliton pulses and Ts is the symbol durationof one single pulse.

Two studies are conducted. First, the waveform is received after transmis-sion over the AWGN channel. Second, the waveform is propagated through thefiber optic channel using the split-step Fourier method (SSFM) with chromaticdispersion parameter 17.5 ps/(nm km), nonlinear coefficient 1.25 1/(W km),fiber loss 0.195 dB/km and increasing transmission distance from 0 to 5004 km.The link is divided into NSp spans of 41.7 km with erbium-doped fiber am-plifiers (EDFAs) compensating for the fiber loss. After coherent detectionthe waveform is sliced into single pulses at R=128 samples and thereafterindependently detected by three receiver schemes.

5.3 Receiver Schemes

5.3.1 NFT ReceiverThe first receiver uses the typical NFDM receiver scheme to recover the data.After transmission and coherent detection, the received pulses are transformedback into the nonlinear spectrum with the NFT. As explained in Section 2.6,a linear shift in NFT-domain compensates for the channel and recovers thescattering coefficients, as shown in Fig. 5.1. Besides mapping the data carryingsolitons into a domain, where the effects of the fiber channel are effectivelycompensated for, the NFT also performs a dimension reduction. A dimensionreduction from the oversampled soliton, to the discrete eigenvalues and theirscattering coefficients. Fig 5.2 shows a dual-polarization NFDM signal after1251 km transmission in the NFT-domain.


-1 0 1

Re[b1(1)]

-1

0

1

Im[b1(1)]

-1 0 1

Re[b1(2)]

-1

0

1

Im[b1(2)]

-1 0 1

Re[b2(1)]

-1

0

1

Im[b2(1)]

-1 0 1

Re[b2(2)]

-1

0

1

Im[b2(2)]

0

Re[ ]

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Im[]

Discrete

eigenvalues

Scattering

coefficients

Y-PolarizationX-Polarization

Figure 5.2: Constellation of dual-polarization NFDM signal in NFT-domainafter 1251 km transmission.

5.3.2 Minimum Distance ReceiverAfter transmitting over a distance L, the minimum distance receiver is trainedby averaging over multiple received instances of each of the N individualsolitons. These averages are used as references u(n)

ref (t, L), n=1, ..., N , to detectfuture received solitons. Hence, for every received soliton pulse uk(t, L) theEuclidean distance to all references is calculated and the reference pulse withminimum distance is chosen for decoding:

nopt[k] = argminn

(∫ ∞

−∞|uk(t, L) − u

(n)ref (t, L)|2dt

), (5.2)

where the integral becomes a summation for sampled signals. The minimumdistance receiver is trained with 100,000 training symbols to extract the refer-ences. For dual polarization ∼ 391 per reference, and for single polarization∼ 6250 per reference.


5.3.3 Neural Network ReceiverThe minimum distance receiver obtains N averaged references after training.It does not take the conditional probability distributions of the channel outputinto account, whereas the neural network receiver estimates these during itstraining process.

With a sampling rate of R samples per symbol, all received pulses lie in anRp dimensional complex space, where p is the number of polarizations. Thus,the neural network learns the mapping from the Rp dimensional complex spaceof pulses to a decision on one of the N possible transmitted messages. TheRp dimensional complex input to the neural network is separated in real andimaginary parts and the decisions are transformed to one-hot encoded vectorsof length N , see Appendix A. Hence, the neural network has 2Rp input nodesand N output nodes. The neural network is trained to process one soliton at atime, which means memory effects in between two neighboring solitons cannotbe taken into account. This also means that the neural network cannot learnthe pattern of the used PRBS [60].

The neural network receiver has one hidden layer with 32 and 128 hiddennodes for single and dual polarization. All neural networks are trained withsigmoid activation functions, softmax output layer, cross-entropy loss functionand the Adam optimization algorithm, see Chapter 3. The dataset size is100,000 symbols for each distance, where 90% are used for the training and10% for testing. The training is stopped when the test performance ceases toimprove (early stopping). An independent dataset of 100,000 symbols for eachdistance is used for validation and all performance estimations in this chapter.The dataset entries are randomly permuted before training and testing.

For further analysis of the neural network hyperparameters, the amountof training data, the number of hidden nodes and the number of samples persymbol are swept.

5.4 Results and Discussion

5.4.1 Receiver Comparison over the AWGN ChannelAs described in Section 3.4, the minimum distance receiver is optimal for theAWGN channel with the optimal decision boundaries, which the neural net-work receiver approximates. In Fig. 5.3, the performance of all three receiversis shown for increasing optical signal-to-noise ratio (OSNR) for the AWGN


-5 0 5 10 15OSNR [dB]

10 -4

10 -3

10 -2

10 -1

10 0

BE

RNFTMinimum distanceNeural network

Figure 5.3: BER in respect to the OSNR for NFT, minimum distance andneural network receivers, and the AWGN channel.

channel. The neural network receiver learns close to optimal decision regions.This suggests, that over the fiber optic channel, the neural network will alsolearn close to optimal performance, given enough training data and a largeenough neural network size. The neural network hyperparameters are opti-mized for a OSNR of 0, which causes an error floor for SNR > 5. The NFTreceiver is very error prone towards an AWGN source.

5.4.2 Temporal and Spectral EvolutionFor correct detection, during transmission the solitons must stay temporallyconfined within their time slot Ts. The power envelope of the NFDM gener-ated solitons evolve during transmission. This evolution is demonstrated inFig. 5.4 (a), one instance of the possible solitons is propagated under losslessand noiseless (ideal) conditions and with losses and noise (nonideal). It isobservable that under ideal conditions (top) the soliton evolves but stays inits time slot. The soliton distorts under non-ideal conditions (bottom), andis in the process of splitting and being removed from its time slot. Further,the PSD of NFDM solitons also evolves during transmission. A whole trainof soliton pulses is propagated under non-ideal conditions and the evolutionof the spectrum of the complex envelope is shown in Fig. 5.4 (b). It is observ-able that the bandwidth changes with the transmission distance and thus thereceiver sampling rate must be chosen accordingly.


Ideal Channel

0 km 5000 km2500 km

Nonideal Channel

Tim

e [a

.u.]

NFDM Soliton Evolution

(a)

(b)

Figure 5.4: (a) Evolution of one soliton pulse through an ideal (lossless andnoiseless) and non-ideal realistic channel. (b) The evolution ofthe PSD of a train of soliton pulses for a non-ideal realistic chan-nel in respect to the transmission distance and the bandwidthcontaining 99% of its power (dashed).

5.4.3 Receiver Comparison over the Fiber OpticChannel

This section presents the simulation results of the three receiver scheme per-formances. Fig. 5.5 shows BER performances for (a) single and (b) dualpolarization transmissions. During transmission the solitons are distorted byattenuation and noise. This causes the solitons to gradually split and moveout of their time slot, as shown in Fig. 5.1 (b). The NFT-based detectionassumes undistorted pulses and that the received solitons haved stayed withintheir time slot throughout transmission. The wrong assumptions lead to aperformance degradation as is shown in Fig. 5.5. With the chosen setup, the


0 1000 2000 3000 4000 5000

Distance [km]

10-4

10-3

10-2

10-1B

ER

NFTMinimum distanceNeural network

(a)

0 1000 2000 3000 4000 5000

Distance [km]

10-4

10-3

10-2

10-1

BE

R

NFTMinimum distanceNeural network

(b)

Figure 5.5: BER in respect to the transmitted distance for NFT, minimumdistance and neural network receivers, (a) single and (b) dualpolarization.

degradation starts after a transmission distance of 1000 km. For shorter dis-tances, in particular around 800-900 km, the performance of the NFT receiverdegrades but improves again at 1000 km. The observable oscillating behav-ior of the performance curves are difficult to explain and must be furtherexamined.

The accumulated distortion of the fiber link cannot be modeled as Gaus-sian noise source, since the neural network receiver outperforms the minimumdistance receiver, which is optimal for a Gaussian noise source. By estimat-ing the probability distribution of the distortions, the neural network receiverallows transmission of up to 3000 km with a BER below 10−3.


1000 1500 2000 2500 3000

Distance [km]

10-4

10-3

10-2

10-1

BE

R

4 Hidden Nodes8 Hidden Nodes16 Hidden Nodes32 Hidden Nodes

(a)

1000 1500 2000 2500 3000

Distance [km]

10-4

10-3

10-2

10-1

BE

R

8 Samples Per Symbol16 Samples Per Symbol32 Samples Per Symbol128 Samples Per Symbol

(b)

1000 1500 2000 2500 3000

Distance [km]

10-4

10-3

10-2

10-1

BE

R

10000 Symbols25000 Symbols75000 Symbols100000 Symbols

(c)

4 16 32 64 96 128

# Hidden Units

10-4

10-3

10-2

10-1

BE

R

2001.6 km3002.4 km

(d)2 16 32 64 128

Samples Per Symbol

10-4

10-3

10-2

10-1

BE

R

2001.6 km3002.4 km

(e)

10000 50000 100000

# Training Symbols

10-4

10-3

10-2

10-1

BE

R

2001.6 km3002.4 km

(f)

Figure 5.6: Neural network receiver BER performance for (a,d) differentnumbers of hidden nodes, (b,e) numbers of samples per symboland (c,f) sizes of the training dataset for the single polarizationcase.

5.4.4 Neural Network Receiver HyperparametersThis section studies three neural network hyperparameters important to itsperformance. Hyperparameter optimization is a multidimensional problemand solved via a grid search in the hyperparameter space [178]. A grid searchfor every neural network trained, at every distance, is infeasible. Yet, thehyperparameters are swept independently to indicate their impact on perfor-mance and complexity. The complexity of a neural network is typically givenby the number of free parameters and hence by the number of nodes, whereashere the number of input nodes is determined by the number of samples persymbol, and the number of output nodes byN . More free parameters empowera neural network to learn more complex patterns from data. Regularizationtechniques and increasing the training set size improves the performance. Tofind the best performing but least complex neural network, three hyperparam-


eters are studied, the number of hidden nodes, the number of input nodes andthe number of training samples:

A neural network with infinite hidden nodes can model any function [145],such that by increasing the number of hidden nodes the performance will reacha limit, where no more information can be extracted from the data. Thislimit also determines a number of hidden nodes for which the performancehas converged, and a neural network with the lowest complexity is obtained.

The sampling rate is in direct relationship to the number of input nodesof the neural network. The number of input nodes must be as low as possiblefor less complexity. At the same time the sampling rate should not conflictwith the sampling theorem, otherwise information is lost. In contrast to typ-ical wavelength-division multiplexing (WDM) systems, the bandwidth of theNFDM signal has no sharp bandlimit, Fig. 5.4, and reducing the samplingrate is a trade-off between providing more information to the neural networkand the complexity of its input layer.

The neural network attempts to learn the probability distribution at thechannel output from the received solitons, with more data, more empiricalevidence is available to estimate it accurately.

For single polarization the performance for different hyperparameters isstudied. Fig. 5.6 shows the performance for the number of hidden nodes, thesampling rate and the training data size. Additionally, at two transmissionlengths, 2001.6 km and 3002.4 km (48 and 72 spans), the hyperparametersare optimized. Fig. 5.6 (a) and (d) show that 16 hidden nodes are sufficient tomodel the data with a neural network, with 8 hidden nodes the performance isdegraded, and no noticeable performance improvement is achieved with 32 ormore hidden nodes. Depending on the transmission distance, sampling ratesof 16 to 32 are sufficient to capture the information provided by the signal,Fig. 5.6 (b) and (e). Fig. 5.6 (c) and (f) show that more than 50,000-75,000data samples for training will yield negligible performance improvement.

5.5 SummaryThis study analyzed different reception methods for NFDM systems encod-ing information in the discrete spectrum. The neural network outperformsthe standard NFT-based receiver and a minimum distance metric receiver.We show, that the statistics of the accumulated distortions obtained by thepropagated soliton is not additive Gaussian, since the neural network receiver

5.5 Summary 71

outperforms the minimum distance receiver. Both alternative receivers out-perform the NFT-based receiver.

72

CHAPTER 6Geometric

Constellation Shapingvia End-to-end LearningThis chapter includes results presented in a submitted journal contribution [S1],and a conference contribution [C1].

6.1 IntroductionThe shaping gain in the additive white Gaussian noise (AWGN) channel isknown to be 1.53 dB [35], whereas the shaping gain for the fiber optic channelis unknown. However, analytical fiber channel models give the insight, thatthe nonlinear impairment occuring in the fiber channel is dependent on statis-tical properties of the constellation in use [27]. In particular, large high-ordermoments of the constellation lead to increased nonlinear interference (NLI).While the amplification noise in the fiber channel is modeled as AWGN source,shaping gain is achieved by a Gaussian-like shaped constellation. Such constel-lations hold relatively large high-order moments. A constellation jointly robustto the constellation dependent nonlinear effects and amplification noise, mustweigh the reduction of high-order moments against a Gaussian-like shape.

This chapter combines an unsupervised learning algorithm with a fiberchannel model to obtain a constellation including this implicit trade-off. Byembedding a fiber channel model within an autoencoder (AE), the encoder isbound to learn a constellation jointly robust to both impairments, as depictedin Fig. 6.1. When the AE is trained by wrapping the Gaussian noise (GN)-model [23, 26], the learned constellation is optimized for an AWGN channel,with an effective signal-to-noise ratio (SNR) determined by the launch power,

74 6 Geometric Constellation Shaping via End-to-end Learning

s1

s2

s3

s4

I

Q

e1 e2 e3 e4

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

Cha

nnel I

Q

r1 p(s1 = 1| · )

r2 p(s2 = 1| · )

r3 p(s3 = 1| · )

r4 p(s4 = 1| · )

Optimization

Encoder Decoder

Figure 6.1: Two neural networks, encoder and decoder, wrap a channelmodel and are optimized end-to-end. Given multiple instances offour different one-hot encoded vectors, the encoder learns fourconstellation symbols which are robust to the channel impair-ments. Simultaneously, the decoder learns to classify the re-ceived symbols such that the initial vectors are reconstructed.

and nonlinear effects are not mitigated. Trained with the nonlinear interfer-ence noise (NLIN)-model [27, 28], the learned constellation mitigates nonlineareffects by optimizing its high-order moments.

This chapter compares the performance in terms of mutual information(MI) and estimated received effective SNR of the learned constellations tostandard quadrature amplitude modulation (QAM) and iterative polar mod-ulation (IPM)-based geometrically shaped constellations. IPM-based geomet-rically shaped constellations were introduced in [64] and thereafter applied inoptical communications [117, 118]. The iterative optimization method opti-mizes for a geometric shaped constellation of multiple rings to achieve shap-ing gain. The IPM-based geometrically shaped constellations optimized underAWGN assumption provided in [64, 117, 118] are used in this work. Results ofboth simulations and experimental demonstrations are presented. The exper-imental validation also revealed the challenge of matching the chosen channelmodel of the training process to the experimental setup.

Section 6.2 describes the AE architecture, its training, and joint optimiza-tion of system parameters. Section 6.3 describes the simulation results evaluat-ing the learned constellations with the NLIN-model and the split-step Fouriermethod (SSFM). The experimental demonstration and results are presented

6.2 End-to-end Learning 75

Neural NetworkR2N → CN

Normalization

Channel

CN → R2N

Neural NetworkSoftmax

s ∈ S = ei | i = 1..M

x ∈ CN

y ∈ CN

r ∈ p ∈ RM+ |∑Mi=1 pi = 1

fθf(·)

Encoder

cGN/NLIN(·)

gθg (·)

Decoder

Figure 6.2: End-to-end AE model.

in Section 6.4. Section 6.5 described joint optimization of system parametersand constellation shape. In Section 6.6 the simulation and experimental re-sults are discussed and thereafter concluded in Section 6.7. The optical fibermodels together with the AE model are available online as Python/TensorFlowprograms [192].

6.2 End-to-end Learning

6.2.1 Autoencoder Model Architecture

An AE model, with neural networks as encoder and decoder, with embeddedchannel model, is depicted in Fig. 6.2 and mathematically described as follows:


x = fθf(s), (6.1a)

y = cNLIN/GN(x), (6.1b)r = gθg (y), (6.1c)

where fθf(·) is the encoder, gθg (·) the decoder and cGN/NLIN(·) are the channel

models. The channel models are described in Section 2.3.5. The goal of the AEis to reproduce the input s at the output r, through the latent variable x (andits impaired version y). The trainable variables of the encoder and decoderneural networks are represented by θf and θg, respectively. The parametervector θ = θf ,θg includes all trainable neural network weights. While theencoder is optimizing the location of non-equidistant constellation points, thedecoder learns the conditional probability distribution of the symbols at thechannel output.

The AE must be constructed, such that the learned constellation has thedesired dimension and order. The constellation dimension is equal to thedimension of the latent space. The constellation order is equal to the inputand output space of the AE. A constellation of order M is trained with one-hot encoded vectors, s ∈ S = ei | i = 1, ...,M. The decoder is concludedwith a softmax function, as described in Appendix A, such that the decoderyields a probability vector, r ∈ p ∈ RM+ |

∑Mi=1 pi = 1.

The neural networks work with real numbers, such that a constellationof N complex dimensions is learned by choosing the output of the encodernetwork and input of the decoder network to be of 2N real dimensions. Anaverage power constraint on the learned constellation is achieved with a nor-malization before the channel.

6.2.2 Autoencoder Model TrainingThe AE model is trained like a single neural network, which is treated inChapter 3. The end-to-end optimization minimizes the cross-entropy loss anduses the momentum based Adam optimizer [187].

Fig. 6.3 shows the convergence of the loss for different training batch sizes.A small training batch size leads to faster convergence but worse final perfor-mance, and a large training batch size leads to slower convergence but superiorfinal performance. A good trade-off between convergence speed, computationtime and performance is obtained when starting the training with smaller

6.2 End-to-end Learning 77

0 50 100 150 200

# Iteration

1.6

1.65

1.7

Cro

ss-e

ntro

py L

oss Training-set size 8M

Training-set size 32MTraining-set size 128MTraining-set size 8M/2048M

Figure 6.3: Convergence of cross-entropy loss for different training batchsizes. The green/dashed curve is trained with a training batchsize of 8M until iteration 100, from where it is increased to2048M . This example learns a constellation with the NLIN-model for M=64.

Table 6.1: Hyperparameters of both encoder and decoder neural network.

Hyperparameter# Layers 1-2# Hidden nodes per layer 16-32Learning rate 0.001Activation function ReLU [139]Optimization method Adam [187]Training batch size Adaptive (see Section 6.2.2)

batch size and increasing it after initial convergence. This is shown in Fig. 6.3,where the green/dashed plot depicts a training run with the training batchsize being increased after 100 iterations. A larger training batch size providesmore empirical evidence of the channel model characteristics. With betterstatistics the AE can optimize more accurately.

The size of the neural networks, number of layers and hidden nodes, isdepending on the order of the constellation. Other neural network hyperpa-rameters are given in Table 6.1. Besides the size of the training batch, thehyperparameters only have marginal effect on the performance of the trainingprocess.


6.2.3 Mutual Information Estimation

From a receiver standpoint both channel models appear Gaussian. With bothchannel models, GN-model or NLIN-model, a receiver under a memorylessGaussian auxiliary channel assumption is the maximum likelihood (ML) re-ceiver [35, 92]. As discussed in Section 3.4, the decoder neural network at-tempts to mimic the ML receiver, but only reaches close to optimal perfor-mance. The MI could be estimated with the output of the decoder, as de-scribed in Appendix A, but the valid auxiliary channel assumption allows toestimate the MI as described in [35, 92]. The latter has superior performance,since the decoder will only learn close to optimal detection, as discussed inSection 3.4. This allows to discard the trained neural networks for transmis-sion and only use the learned constellations. At the transmitting end, theencoder neural network is replaced by a look-up table. The training processstill requires both neural networks due to their differentiability.

With a dispersion free channel model [193] or intensity modulation/directdetection (IM/DD) transmission [194], where a Gaussian auxiliary channelassumption does not hold, the trained decoder neural network or anotherreceiver is required for detection.

6.2.4 Learned Constellations

A constellation obtained by training an AE with embedded NLIN-model orGN-model are referred to as AE-NLIN or AE-GN constellation, respectively.Fig. 6.4 shows a set of learned constellations at different launch power levels.The top row shows AE-GN constellations and the bottom row shows AE-NLIN constellations. At low power levels (left), the available SNR is toolow and the AE found a constellation where the inner constellation points arerandomly positioned. At optimal per channel launch power level (center), bothconstellations are similar and also yield similar performance in MI. At highpower levels (right), the NLI is the primary impairment. All points in the AE-NLIN constellation form a ring of uniform intensity. This is because a ring-likeconstellation has minimized moments and minimizes the NLI. The AE-GNconstellation, trained under an AWGN channel assumption independent ofmodulation dependent effects, lack this property.

6.3 Simulation 79

In-phase

Qua

drat

ure

MI=7.05

GN-model

-5.5 dBm

In-phase

Qua

drat

ure

MI=9.31

GN-model

0.0 dBm

In-phase

Qua

drat

ure

MI=6.71

GN-model

4.5 dBm

In-phase

Qua

drat

ure

MI=1.95

GN-model

9.5 dBm

In-phase

Qua

drat

ure

MI=7.05

NLIN-model

-5.5 dBm

In-phase

Qua

drat

ure

MI=9.32

NLIN-model

0.0 dBm

In-phase

Qua

drat

ure

MI=6.88

NLIN-model

4.5 dBm

In-phase

Qua

drat

ure

MI=2.47

NLIN-model

9.5 dBm

Figure 6.4: Constellations learned with (top) the GN-model and (bottom)the NLIN-model forM=64 and 2000 km (20 spans) transmissionlength at per channel launch power (left to right) −5.5 dBm,0.0 dBm, 4.5 dBm and 9.5 dBm.

6.3 Simulation

6.3.1 Setup

For the simulation study, a wavelength-division multiplexing (WDM) commu-nication system was simulated with the SSFM. In Table 6.2, the transmissionparameters are given. The transmitter models digital pulse-shaping and idealoptical in-phase (I)/quadrature (Q) modulation. All channels are assumed tohave the same optical launch power and draw from the same constellation.The channel included multiple spans with EDFAs. The receiver modeled anoptical filter for the center channel, digital chromatic dispersion compensa-tion and matched filtering. Also the NLIN-model was used for performanceevaluations. The AE learned constellations, QAM constellations and IPM-based geometrically shaped constellations were used to simulate the systemdescribed above.


Simulation Parameter# symbols 131072 = 217

Symbol rate 32 GBdOversampling 32Channel spacing 50 GHz# Channels 5# Polarisation 2Pulse-shaping root-raised-cosineRoll-off factor SSFM: 0.05

NLIN-model: 0 (Nyquist)Span length 100 kmNonlinear coefficient 1.3 (W km)−1

Dispersion parameter 16.48 ps/(nm km)Attenuation 0.2 dB/kmEDFA noise figure 5 dBStepsize1 0.1 km

Table 6.2: Simulation parameters of SSFM simulations and NLIN-modelevaluations.

6.3.2 ResultsThe performance of the constellations was compared in terms of the measuredeffective SNR and the MI, which represents maximum achievable informationrate (AIR) under an auxiliary channel assumption [35]. The simulation resultsfor a 2000 km transmission (20 spans) are shown in Fig. 6.5. The learnedconstellations outperform the QAM constellations in MI, and the IPM-basedconstellations in effective SNR. The QAM constellations have largest effectiveSNR due to their low high-order moments, but do not provide shaping gain.

The optimal launch power for the NLIN-model learned constellations isslightly shifted towards higher power levels compared to GN-model learnedconstellations and IPM-based constellations. Fig. 6.6 (a) shows, that for higherpower levels the gain in respect to QAM constellations is larger for the NLIN-model learned constellations, and in Fig. 6.6 (b), the NLIN-model learnedconstellations have a negative slope in moment. This shows when embedding

1In a fruitful discussion with Associate Professor Andrea Carena from the PolytechnicUniversity of Turin, it was called to attention, that a constant stepsize, as chosen here, canlead to artificial effects within the SSFM. It can be avoided by randomizing the stepsize atevery step.

6.3 Simulation 81

-2 -1 0 1 2

Launch Power Per Channel [dBm]

13

13.5

14

14.5

effective S

NR

M = 64

256 QAM 256 AE-GN 256 AE-NLIN 256 IPM


Split-step Fourier

-2 -1 0 1 2


13

13.5

14

14.5

effective S

NR

M = 256



Analytical Model (NLIN)

-2 -1 0 1 2


8.5

9

9.5

MI

[bit/4

D-s

ym

bo

l]

M = 64

-2 -1 0 1 2


8.5

9

9.5

MI

[bit/4

D-s

ym

bo

l]

M = 256

Figure 6.5: Performance in (top) effective SNR and (bottom) MI in re-spect to launch power after 2000 km transmission (20 spans)for (left) M=64 and (right) M=256. Plots denoted as AE-GN and AE-NLIN indicate that the constellation was learnedwith an AE using the GN-model and NLIN-model, respectively.Lines depict performance evaluations using the NLIN-model andmarkers using the SSFM.


-5 -4 -3 -2 -1 0 1 2 3 4 5


-0.15

-0.05

0.05

0.15

0.25

0.35

0.45

MI G

ain

[bit/4

D-s

ym

bol]

64 IPM

64 NLIN

64 GN

256 IPM

256 NLIN

256 GN

Model

64 IPM

64 NLIN

64 GN

256 IPM

256 NLIN

256 GN

SSF

(a)

-5 -4 -3 -2 -1 0 1 2 3 4 5


1

2

3

4

5

Moment

GN 256

IPM 256

NLIN 256QAM 256

(b)

5 10 15 20 25 30 35 40 45 50 55

# of Spans

0

0.2

0.4

0.6

MI G

ain

[bit/4

D-s

ym

bol] 64 NLIN

256 NLIN

64 IPM

256 IPM

64 GN

256 GN

(c)

Figure 6.6: (a) Gain compared to the standardM -QAM constellations in re-spect to the launch power after 2000 km transmission (20 spans).(b) 4-th and 6-th order moment (µ4 and µ6) of the consideredconstellations with M=256 in respect to the per channel launchpower and 2000 km transmission distance. (c) Gain comparedto the standard M -QAM constellations at the respective opti-mal launch power in respect to the number of spans estimatedwith the NLIN-model.

6.4 Experimental Demonstration 83

AWGs

AWGs

Even Channels

Odd Channels

CoherentRx

80 Gs/s

50 GHz spaced carriers

PC

EDFA

VOA

MCFBPFIL

LODP-IQ

DP-IQ

Figure 6.7: Experimental setup. Abbr.: (IL) Interleaver, (PC) Polariza-tion Controller, (AWG) Arbitrary Waveform Generator, (DP-IQ) Dual-Polarization IQ Modulator, (EDFA) Erbium DopedFiber Amplifier, (VOA) Variable Optical Attenuator, (MCF)Multicore Fiber (used to emulate 3 identical single-mode fiberspans), (BPF) Band-pass Filter, (LO) Local Oscillator.

the NLIN-model within the AE, the learned constellations have smaller high-order moments and smaller NLI. At optimal launch power and transmissiondistances between 2500 km and 5500 km, the improvements are marginal, asshown in Fig. 6.6 (c). Yet, at shorter distances the learned constellationsoutperformed the IPM-based constellations.

For a transmission distance of 2000 km (20 spans) at the respective optimallaunch power, the learned constellations obtained 0.02 bit/4D and 0.00 bit/4Dhigher MI than IPM-based constellations evaluated with the NLIN-model forM=64 and M=256, respectively. Evaluated with the SSFM, the learned con-stellations obtained 0.00 bit/4D and 0.00 bit/4D improved performance in MIcompared to IPM-based constellations for M=64 and M=256, respectively.

6.4 Experimental Demonstration

6.4.1 SetupFig. 6.7 depicts the experimental setup. The 5 channel WDM system wascentered at 1546.5 nm with a 50 GHz channel spacing with the carrier linesextracted from a 25 GHz comb source using one port of a 25 GHz/50 GHzoptical interleaver. Four arbitrary waveform generator (AWG) with a baudrate of 24.5 GBd were used to modulate two electrically decorrelated signalsonto the odd and even channels, generated in a second 50 GHz/100 GHz op-


tical interleaver, (IL in Fig. 6.7). The fiber channel was comprised of 1 to3 fiber spans with lumped amplification by EDFAs. In the absence of iden-tical fiber spans, we chose to use 3 far separated outer cores of a multicorefiber. The multicore fiber was of 53.5 km length and the crosstalk betweencores was measured to be negligible. The launch power in to each span wasadjusted by variable optical attenuators (VOA) placed after in-line EDFAs.By scanning the fiber launch power, also the input power to post-span EDFAschanges, which may slightly impact the noise performance. At the receivingend, the center channel was optically filtered and passed on to the heterodynecoherent receiver. The digital signal processing (DSP) was performed offline.It involved resampling, digital chromatic dispersion compensation, and dataaided algorithms for both the estimation of the frequency offset and for thefilter taps of the least mean squares (LMS) dual-polarization equalization.The equalization integrated a blind phase search. The data aided algorithmsare further discussed in Section 6.6.3. The AE model was enhanced to in-clude transmitter imperfections in terms of an additional AWGN source inbetween the normalization and the channel. For the experiment, new con-stellations must be learned by the enhanced AE model including transmitterimperfections. Otherwise the learned constellations are not suited for the usedexperimental setup. The back-to-back SNR for a 64-QAM constellation wasmeasured and yielded 22.55 dB, whereas the back-to-back SNR of a 64-AE-NLIN constellation yielded 21.87 dB. This latter back-to-back SNR was usedwithin the enhanced AE model to simulate the transmitter imperfections andlearn all the AE constellations for the experiments. We note that it was chal-lenging to reliably measure some experimental features such as transmitterimperfections and varying EDFA noise performance and accurately representthem in the model. Hence, it is likely that some discrepancy between theexperiment and modeled link occurred, as further discussed in Section 6.6.2.

6.4.2 ResultsAlso experimentally, the constellations were evaluated using the measuredeffective SNR and the MI. Fig. 6.8 shows the performances of the learned,QAM and IPM-based constellations. In general, the learned constellationsyielded improved performances compared to the IPM-based and QAM con-stellations. However, with more spans the discrepancy between the experi-ment and modeled link increases. Over one span, Fig. 6.8 (left), the learnedconstellations outperformed QAM and IPM-based constellations. For M=64

6.4 Experimental Demonstration 85

64 QAM

64 IPM 64 AE-GN

64 AE-NLIN 256 QAM

256 IPM AE-256 GN

AE-256 NLIN

-4 -2 0 2 4 6 8 10 12Launch Power [dBm]

19

19.5

20

20.5

21

21.5

22

22.5

effe

ctiv

e S

NR

-4 -2 0 2 4 6 8 10 12Launch Power [dBm]

19

19.5

20

20.5

21

21.5

22

22.5ef

fect

ive

SN

R

-4 -2 0 2 4 6 8 10 12Launch Power [dBm]

19

19.5

20

20.5

21

21.5

22

22.5

effe

ctiv

e S

NR

-4 -2 0 2 4 6 8 10 12Launch Power [dBm]

11.6

11.8

12

MI [

bit/4

D-s

ymbo

l]

-4 -2 0 2 4 6 8 10 12Launch Power [dBm]

11.6

11.8

12

MI [

bit/4

D-s

ymbo

l]

-4 -2 0 2 4 6 8 10 12Launch Power [dBm]

11.6

11.8

12

MI [

bit/4

D-s

ymbo

l]

-4 -2 0 2 4 6 8 10 12Launch Power [dBm]

13

13.5

14

MI [

bit/4

D-s

ymbo

l]

-4 -2 0 2 4 6 8 10 12Launch Power [dBm]

13

13.5

14

MI [

bit/4

D-s

ymbo

l]

-4 -2 0 2 4 6 8 10 12Launch Power [dBm]

13

13.5

14

MI [

bit/4

D-s

ymbo

l]

Figure 6.8: Experimental Results. (top) Effective SNR in respect to thelaunch power for M=64 and M=256 constellations. (mid) MIin respect to the launch power for M=64 constellations. (bot-tom) MI in respect to the launch power for M=256 constella-tions. (left), (center) and (right) depict the results of one,two and three span transmissions, respectively.


the IPM-based constellations performed inferior. Over two spans and M=64,Fig. 6.8 (center), IPM-based constellations yielded improved performances.Over two spans and for M=256, they outperformed the GN-model learnedconstellations and matched the performance of the NLIN-model learned con-stellations. This indicates, that the NLIN-model learned constellations failedto mitigate nonlinear effects due to the model discrepancy. Similar findings tothe transmissions over two spans were found over three spans, Fig. 6.8 (right).

Collectively, the learned constellations, either learned with the NLIN-modelor the GN-model, yielded improved performance compared to QAM and IPM-based constellations at their respective optimal launch power. For M=64 theimproved performance was 0.02 bit/4D, 0.04 bit/4D and 0.03 bit/4D in MIover 1, 2 and 3 spans transmission, respectively. For M=256 the improvedperformance was 0.12 bit/4D, 0.04 bit/4D and 0.06 bit/4D in MI over 1, 2and 3 spans transmission, respectively.

6.5 Joint Optimization of Constellation andSystem Parameters

The AE model is constructed and trained with the machine learning frameworkTensorFlow [13]. It provides an application programming interface to composecomputation graphs. Both the neural networks and the fiber channel modelare implemented as such. TensorFlow includes powerful optimization methodsfor the training of neural networks. Yet, the training is not limited to theweights of the neural network. Also other parameters of the computationgraph can be trained jointly, which is achieved by extending θ, as long as themodel stays differentiable w.r.t. θ.

In Section 6.3.2, the results show that the optimal per channel launchlower of the M=256 NLIN-model learned constellation lies at 0.13 dBm. Thisoptima is obtained through training several AEs at different launch powers.Yet, the optima can be estimated jointly with the training for a constellationshape, by including the launch power to the trainable parameters of the AEmodel, θ = θf ,θg, Ptx. The training process starts with an initial guessfor all parameters. The neural network weights are initialized with Gaussiannoise. The launch power converges towards the optimal value of 0.13 dBm,Fig. 6.9 (a), even when starting the training process with different initialvalues. Further, the performances in cross-entropy and MI of all training

6.6 Discussion 87

0 10 20 30 40 50

# Iteration

-15

-10

-5

0

5

10

15

La

un

ch

Po

we

r [d

Bm

]

0.13 dBm

(a)0 10 20 30 40 50

# Iteration

2

3

4

5

6

Cro

ss-e

ntro

py L

oss

(b)

0 10 20 30 40 50# Iteration

0

2

4

6

8

10

MI [

bit/4

D-s

ymbo

l]

(c)

Figure 6.9: (a) Convergence of the per channel launch power starting fromdifferent initial values, and respective (b) cross-entropy loss and(c) MI, when jointly learning the constellation shape and launchpower. The constellation is learned using the NLIN-model forM=256.

runs converge independent from the initial value chosen for Ptx, as shown inFig. 6.9 (b) and (c).

6.6 Discussion

6.6.1 Mitigation of Nonlinear EffectsThe NLIN-model describes a relationship between the launch power, the high-order moments of the constellation and signal degrading nonlinear effects.Hence, there exist geometric shaped constellations that minimize nonlineareffects by minimizing their high-order moments. In contrast, constellations ro-bust to Gaussian noise are Gaussian shaped, with large high-order moments.This means, the minimization only improves the system performance if therelative contributions of the modulation dependent nonlinear effects are sig-nificant compared to those of the amplified spontaneous emission (ASE) noise.If the relative contributions to the overall signal degradation of both noisesources are similar, an optimal constellation must be optimized for a trade-offbetween the two. The AE model is able to capture these channel character-istics, as shown by the higher robustness of the NLIN-model learned constel-lations. At higher power levels these learned constellations yield a gain inMI and SNR compared to IPM-based geometrically shaped constellations andconstellations learned with the GN-model. Further, in Fig. 6.6 (b), the high-order moments of the more robust constellations decline with larger launch


power. This shows that nonlinear effects are mitigated. At larger transmissiondistance, Fig. 6.6 (c), modulation format dependent nonlinear effects becomeless dominant and therefore also the gains of the constellations learned withthe NLIN-model are reduced.

6.6.2 Model DiscrepancyThe learned constellations match the performance or outperform the IPM-based geometrically shaped constellations. Yet, the results are not distinctlyshowing that the NLIN-model learned constellations mitigate nonlinear effectsexperimentally. The channel model parameters used for the AE model train-ing must match those of the experimental setup. Unknown fiber parameters,imperfect components and device operating points make this challenging. Inparticular, transmitter imperfections due to digital-to-analog conversion arehard to capture in the model, since they can be specific to a particular constel-lation. For the experimental demonstration in Section 6.4.1, the SNR at thetransmitter was measured for a QAM constellation and included as AWGNsource into the AE model. Including further transmitter/receiver componentsis desirable, but also adds more unknowns. The sensitivity of the presentedmethod to such uncertainties must be further studied. Methods introduc-ing reinforcement learning [173], or adversarial networks [195] can potentiallyavoid the modeling discrepancy and learn in direct interaction with the realsystem.

6.6.3 Digital Signal Processing DrawbacksStandard signal recovery algorithms often exploit symmetries inherent in QAMconstellations. This is a typical issue for non-standard QAM constellations.Constellation independent pilot-aided DSP algorithms are typically requiredin such cases [196, 197]. In order to receive the learned constellations, the fre-quency offset estimation implements an exhaustive search based on temporalcorrelations. It correlates the received signal with many different frequencyshifted versions of the transmitted signal until the matching frequency offsetis found. The adaptive LMS equalizer taps are estimated with full knowledgeof the transmitted sequence. The blind phase search is integrated within theequalizer, but has no knowledge of the data and searches across all anglesinstead of angles of one quadrant.

6.7 Summary 89

Other constellation shaping methods for the fiber optic channel, which con-strain the solution space to square constellations preserve required symmetries,such that standard DSP algorithms are applicable [133].

6.7 SummaryThis chapter proposed a new method for geometric shaping in fiber optic com-munication systems, by using an unsupervised machine learning algorithm.With a differentiable channel model including modulation dependent nonlin-ear effects, the learning algorithm yields a constellation mitigating these, withgains of up to 0.13 bit/4D in simulation and up to 0.12 bit/4D experimentally.The machine learning optimization method is independent of the embeddedchannel model and allows joint optimization of system parameters. The ex-perimental results show improved performances but also highlight challengesregarding matching the model parameters to the real system and limitationsof conventional DSP algorithms for the learned constellations.

90

CHAPTER 7Conclusion

The increasing demand for global data traffic must be met with more flexibleand scalable coherent optical communication systems. The limitations of suchsystems require better fiber channel models, complex optimization methodsand consideration of entirely new approaches for transmission. Further, futureoptical communication systems will grow in complexity, in which autonomousalgorithms can help to explore possible solutions for increasing transmissiondata rates and resource allocation. Machine learning promises to provide so-lutions to the described future challenges. The methods presented in thiswork include a machine learning aided fiber channel model to increase itscomputational speed, a viable receiver for a novel transmission technique, anda learning algorithm optimizing for a modulation format. They provide ini-tial and explorative studies of machine learning methods for coherent opticalcommunication systems which future research can built upon.

This thesis presented a review of machine learning methods, and appliedthem in novel contexts to coherent optical communication systems increasingtheir capabilities. In this section, the results of this work are summarized andan outlook on potential future research directions is presented.

7.1 SummaryThe reviewed machine learning methods were applied in coherent optical com-munication systems as follow:


Models of the nonlinear optical fiber channel are important to provide simula-tion tools, and to assess the system performance. The most accurate modelsare only semi-analytic, leading to exhaustive computation, if the impact ofspecific system parameters are of interest. In this case real-time applications

92 7 Conclusion

can become unfeasible. In Chapter 4, with a sparse dataset of the computa-tional expensive properties, a machine learning algorithm is trained to predictand interpolate them. Going forward the machine learning algorithm predictsthe properties and the computationally heavy step is skipped. The new ma-chine learning aided channel model is not analytic, but the reduced processingrequirement means that real-time analysis becomes feasible.


The study in Chapter 5 analyses different reception methods for nonlinearFourier transform (NFT) derived signals. Three receivers are compared, anNFT-based receiver with decision in the nonlinear spectra, a time-domainminimum distance metric receiver and a neural network receiver trained onhistorical data. The neural network outperforms the standard NFT-based re-ceiver and the minimum distance metric receiver. Distortions caused by lossesand noise during transmission, are not compensated for by the NFT receiver.The neural network handles such impairments by learning the distortion char-acteristics from previous transmissions and is highly adaptable to the systemconfiguration. The poor performance of the minimum distance metric receiverleads to the conclusion, that the accumulated distortions throughout the linkcannot be modeled as additive white Gaussian noise (AWGN). The obtainedresults present a performance benchmark for future NFT receiver algorithms.


In Chapter 6, a new geometric constellation shaping method for fiber opticcommunication systems was proposed. By embedding a fiber channel modelwithin two neural networks, the fiber channel characteristics are captured, suchthat the learned constellation shape is jointly robust to amplification noiseand signal dependent nonlinear effects. The method is independent of thechannel model, as long as it is differentiable w.r.t. the optimized parameters.A simulation and experimental study were conducted and performance gainwas reported, with up to 0.13 bit/4D in simulation and up to 0.12 bit/4Dexperimentally. Problems were identified, such as discrepancies in betweenthe fiber channel model and the experimental setup and challenges to thedigital signal processing (DSP) subsystem.

7.2 Outlook 93

7.2 OutlookFuture directions for the proposed methods are given as follows:


The presented framework learns a nonlinear mapping between physical layerparameters and computational expensive model properties. With more datathe neural network will yield more accurate predictions. Future research mayinclude a sensitivity analysis of the required density of data samples in eachdimension of the input to the neural network. Further, since additional datasamples are available through computation, one may determine where in theinput space an extra training sample would yield the most performance gainafter retraining the algorithm. Machine learning methods that allow the al-gorithm to query new data points are called active learning and would be aninteresting avenue of future study [198, 199].


In Chapter 5 it was argued, that since for the AWGN channel the trainedneural network learns a close to optimal decision rule, the learned decision rulefor the nonlinear optical fiber channel must also be close to optimal. The poorperformance of the NFT-based receiver must be further studied, in particularthe impact of the NFT on the received signal, since it could cause informationloss. An interesting area of further study would be to apply classification inthe NFT-domain on the observed eigenvalues and scattering coefficients. If theperformance of an NFT-domain neural network is worse than the presentedtime-domain neural network, the NFT transformation at the receiver mustintroduce information loss. If the performance of the two neural networks issimilar, the NFT is a viable dimension reduction algorithm.


In Chapter 6, discrepancies in between the fiber channel model and the exper-imental setup were reported. Methods independent of a differentiable channelmodel are an interesting area for future research. In particular, generativeadversarial network [195] or methods from the reinforcement learning commu-nity [173] are able to learn in interaction with a real system. Such methods

94 7 Conclusion

are a generalization of the ones presented in this thesis and would be beneficialto explore in future work.

The obtained constellation shapes are incompatible with standard DSPalgorithms, which were developed for modulation formats with intrinsic sym-metries. For practical commercial use of the learned constellation shapes, newDSP algorithms are required. These algorithms could be machine learningaided [200], or the learning algorithm must be forced to learn constellationshapes with intrinsic symmetries [133], making them compatible with stan-dard DSP algorithms. Moreover, extending the optimization method to opti-mize for generalized mutual information (GMI) rather than mutual informa-tion (MI), will certainly yield different constellation shapes and is importantfor integration with binary forward error correction (FEC) [127].

In contrast to the applied channel models, the optical fiber channel is notmemoryless due to dispersive effects. Combining temporal machine learningmethods, such as recurrent neural networks [158], with models including mem-ory [29, 190, 191, 201], might yield larger gains.

However, the reported results show that machine learning is a viablemethod for optimization within complex optical communication systems. Thestudy presents an initial exploration of the end-to-end learning method foroptical coherent transmission systems.

APPENDIX ASoftmax Activation and

Mutual InformationEstimation

A neural network trained for classification is concluded with a softmax layer [158],ensuring that the output of the neural network is always a probability vector.The softmax layer is given by:

softmaxi(x) = exi∑Dd=1 e

xd, for i = 1, ..., D (A.1)

where xi are the output nodes of the neural network and D the number ofclasses. The number of neural network output nodes is equivalent to thenumber of classes. For training, the targets of the data are required to betranscribed to probability vectors with 100% certainty. Such vectors are typ-ically called one-hot encoded vectors, holding a single one for the respectiveclass and else zeros, ei | i = 1, ..., D, where ei are the unit vectors.

The classification example in Section 3.4 trains a neural network to detectobservations of QPSK symbols transmitted over the additive white Gaussiannoise (AWGN) channel. The symbols are equal probable with the probability1D . The output of the softmax layer, since it yields a probability distributionover the classes, can be used for mutual information (MI) estimation:

I(X;Y ) ≥ E

[log2

pX|Y (x|y)pX(x)

], (A.2a)

≈N∑n=1

log2

(∑Di=1 softmaxi(x(n)) · t(n)

i1D

), (A.2b)

96 A Softmax Activation and Mutual Information Estimation

where N is the number of data samples and t(n) the target vectors of therespective data sample. The sum in the numerator requires only one evaluationsince all targets are one-hot encoded.

Bibliography[1] M. Castells et al., Information technology, globalization and social de-

velopment. United Nations Research Institute for Social DevelopmentGeneva, 1999, vol. 114.

[2] Cisco, “Cisco visual networking index: Forecast and methodology, 2016–2021,” CISCO White paper, 2017.

[3] P. J. Winzer, D. T. Neilson, and A. R. Chraplyvy, “Fiber-optic trans-mission and networking: the previous 20 and the next 20 years,” Opticsexpress, vol. 26, no. 18, pp. 24 190–24 239, 2018.

[4] K. Kao and G. A. Hockham, “Dielectric-fibre surface waveguides foroptical frequencies,” in Proceedings of the Institution of Electrical En-gineers, IET, vol. 113, 1966, pp. 1151–1158.

[5] T. Miya, Y. Terunuma, T. Hosaka, and T. Miyashita, “Ultimate low-loss single-mode fibre at 1.55 µm,” Electronics Letters, vol. 15, no. 4,pp. 106–108, 1979.

[6] E. Desurvire, Erbium-doped fiber amplifiers: principles and applications.Wiley New York, 1994, vol. 19.

[7] J. Zyskind and A. Srivastava, Optically amplified WDM networks. Aca-demic press, 2011.

[8] E. Ip, A. P. T. Lau, D. J. Barros, and J. M. Kahn, “Coherent detectionin optical fiber systems,” Optics express, vol. 16, no. 2, pp. 753–791,2008.

[9] S. J. Savory, “Digital coherent optical receivers: algorithms and sub-systems,” IEEE Journal of Selected Topics in Quantum Electronics,vol. 16, no. 5, pp. 1164–1179, 2010.

98 Bibliography

[10] S. L. Olsson, J. Cho, S. Chandrasekhar, X. Chen, E. C. Burrows, andP. J. Winzer, “Record-High 17.3-bit/s/Hz Spectral Efficiency Transmis-sion over 50 km Using Probabilistically Shaped PDM 4096-QAM,” in2018 Optical Fiber Communications Conference and Exposition (OFC),IEEE, 2018, pp. 1–3.

[11] P. J. Winzer and D. T. Neilson, “From scaling disparities to integratedparallelism: A decathlon for a decade,” Journal of Lightwave Technol-ogy, vol. 35, no. 5, pp. 1099–1115, 2017.

[12] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior,P. Tucker, K. Yang, Q. V. Le, et al., “Large scale distributed deepnetworks,” in Advances in neural information processing systems, 2012,pp. 1223–1231.

[13] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: a system forlarge-scale machine learning.,” in OSDI, vol. 16, 2016, pp. 265–283.

[14] R. S. Tucker, “Green optical communications—part i: Energy limita-tions in transport,” IEEE Journal of selected topics in quantum elec-tronics, vol. 17, no. 2, pp. 245–260, 2011.

[15] G. P. Agrawal, Nonlinear Fiber Optics, 5th ed. Academic Press, 2012.[16] R.-J. Essiambre, G. Kramer, P. J. Winzer, G. J. Foschini, and B.

Goebel, “Capacity limits of optical fiber networks,” Journal of Light-wave Technology, vol. 28, no. 4, pp. 662–701, 2010.

[17] P. P. Mitra and J. B. Stark, “Nonlinear limits to the information capac-ity of optical fibre communications,” Nature, vol. 411, no. 6841, p. 1027,2001.

[18] A. D. Ellis, J. Zhao, and D. Cotter, “Approaching the non-linear Shan-non limit,” Journal of Lightwave Technology, vol. 28, no. 4, pp. 423–433, 2010.

[19] R. I. Killey and C. Behrens, “Shannon’s theory in nonlinear systems,”Journal of Modern Optics, vol. 58, no. 1, pp. 1–10, 2011.

[20] E. Agrell, M. Karlsson, A. Chraplyvy, D. J. Richardson, P. M. Krumm-rich, P. Winzer, K. Roberts, J. K. Fischer, S. J. Savory, B. J. Eggleton,et al., “Roadmap of optical communications,” Journal of Optics, vol. 18,no. 6, p. 063 002, 2016.

Bibliography 99

[21] J. C. Cartledge, F. P. Guiomar, F. R. Kschischang, G. Liga, and M. P.Yankov, “Digital signal processing for fiber nonlinearities,” Optics ex-press, vol. 25, no. 3, pp. 1916–1936, 2017.

[22] G. P. Agrawal, Fiber-optic communication systems, 4th ed. John Wiley& Sons, 2012.

[23] P. Poggiolini, “The GN model of non-linear propagation in uncom-pensated coherent optical systems,” Journal of Lightwave Technology,vol. 30, no. 24, pp. 3857–3879, 2012.

[24] M. Secondini and E. Forestieri, “Analytical fiber-optic channel model inthe presence of cross-phase modulation,” IEEE Photonics TechnologyLetters, vol. 24, no. 22, pp. 2016–2019, 2012.

[25] M. Secondini, E. Forestieri, and G. Prati, “Achievable information ratein nonlinear WDM fiber-optic systems with arbitrary modulation for-mats and dispersion maps,” Journal of Lightwave Technology, vol. 31,no. 23, pp. 3839–3852, 2013.

[26] P. Poggiolini, G. Bosco, A. Carena, V. Curri, Y. Jiang, and F. Forghieri,“The GN-model of fiber non-linear propagation and its applications,”Journal of lightwave technology, vol. 32, no. 4, pp. 694–721, 2014.

[27] R. Dar, M. Feder, A. Mecozzi, and M. Shtaif, “Properties of nonlinearnoise in long, dispersion-uncompensated fiber links,” Optics Express,vol. 21, no. 22, pp. 25 685–25 699, 2013.

[28] R. Dar, M. Feder, A. Mecozzi, and M. Shtaif, “Accumulation of non-linear interference noise in fiber-optic systems,” Optics express, vol. 22,no. 12, pp. 14 199–14 211, 2014.

[29] R. Dar, M. Feder, A. Mecozzi, and M. Shtaif, “Inter-channel nonlinearinterference noise in WDM systems: modeling and mitigation,” Journalof Lightwave Technology, vol. 33, no. 5, pp. 1044–1053, 2015.

[30] A. Carena, G. Bosco, V. Curri, Y. Jiang, P. Poggiolini, and F. Forghieri,“EGN model of non-linear fiber propagation,” Optics Express, vol. 22,no. 13, pp. 16 335–16 362, 2014.

[31] Z. Tao, L. Dou, W. Yan, L. Li, T. Hoshida, and J. C. Rasmussen,“Multiplier-free intrachannel nonlinearity compensating algorithm oper-ating at symbol rate,” Journal of Lightwave Technology, vol. 29, no. 17,pp. 2570–2576, 2011.

100 Bibliography

[32] O. Golani, M. Feder, and M. Shtaif, “Kalman-MLSE Equalization forNLIN Mitigation,” Journal of Lightwave Technology, vol. 36, no. 12,pp. 2541–2550, 2018.

[33] K. V. Peddanarappagari and M. Brandt-Pearce, “Volterra series trans-fer function of single-mode fibers,” Journal of lightwave technology,vol. 15, no. 12, pp. 2232–2241, 1997.

[34] M. P. Yankovn, F. Da Ros, E. P. da Silva, S. Forchhammer, K. J.Larsen, L. K. Oxenløwe, M. Galili, and D. Zibar, “Constellation shap-ing for WDM systems using 256QAM/1024QAM with probabilistic op-timization,” Journal of Lightwave Technology, vol. 34, no. 22, pp. 5146–5156, 2016.

[35] T. Fehenberger, A. Alvarado, G. Böcherer, and N. Hanik, “On prob-abilistic shaping of quadrature amplitude modulation for the nonlin-ear fiber channel,” Journal of Lightwave Technology, vol. 34, no. 21,pp. 5063–5073, 2016.

[36] D. Richardson, J. Fini, and L. Nelson, “Space-division multiplexing inoptical fibres,” Nature Photonics, vol. 7, no. 5, p. 354, 2013.

[37] P. J. Winzer, “Energy-efficient optical transport capacity scaling throughspatial multiplexing,” IEEE Photonics Technology Letters, vol. 23, no. 13,pp. 851–853, 2011.

[38] B. Puttnam, R. Luıs, W. Klaus, J. Sakaguchi, J.-M. D. Mendinueta,Y. Awaji, N. Wada, Y. Tamura, T. Hayashi, M. Hirano, et al., “2.15Pb/s transmission using a 22 core homogeneous single-mode multi-corefiber and wideband optical comb,” in Optical Communication (ECOC),2015 European Conference on, IEEE, 2015, pp. 1–3.

[39] G. Rademacher, R. S. Luıs, B. J. Puttnam, T. A. Eriksson, E. Agrell,R. Maruyama, K. Aikawa, H. Furukawa, Y. Awaji, and N. Wada, “159Tbit/s C+ L Band Transmission over 1045 km 3-Mode Graded-IndexFew-Mode Fiber,” in Optical Fiber Communication Conference, OpticalSociety of America, 2018, Th4C–4.

[40] R. Luıs, G. Rademacher, B. Puttnam, T. Eriksson, H. Furukawa, A.Ross-Adams, S. Gross, M. Withford, N. Riesen, Y. Sasaki, K. Saitoh,K. Aikawa, Y. Awaji, and N. Wada, “1.2 Pb/s Transmission Over a 160µm Cladding, 4-Core, 3-Mode Fiber, Using 368 C+L band PDM-256-QAM Channels,” in Optical Communication (ECOC), 2018 EuropeanConference on, IEEE, 2018.

Bibliography 101

[41] J. Mata, I. de Miguel, R. J. Durán, N. Merayo, S. K. Singh, A. Jukan,and M. Chamania, “Artificial intelligence (AI) methods in optical net-works: A comprehensive survey,” Optical Switching and Networking,2018.

[42] F. Musumeci, C. Rottondi, A. Nag, I. Macaluso, D. Zibar, M. Ruffini,and M. Tornatore, “A Survey on Application of Machine Learning Tech-niques in Optical Networks,” arXiv preprint arXiv:1803.07976, 2018.

[43] J. Estaran, R. Rios-Müller, M. Mestre, F. Jorge, H. Mardoyan, A. Kon-czykowska, J.-Y. Dupuy, and S. Bigo, “Artificial neural networks forlinear and non-linear impairment mitigation in high-baudrate IM/DDsystems,” in ECOC 2016; 42nd European Conference on Optical Com-munication; Proceedings of, VDE, 2016, pp. 1–3.

[44] P. Li, L. Yi, L. Xue, and W. Hu, “100Gbps IM/DD Transmission over25km SSMF using 20G-class DML and PIN Enabled by Machine Learn-ing,” in Optical Fiber Communication Conference, Optical Society ofAmerica, 2018, W2A–46.

[45] C.-Y. Chuang, L.-C. Liu, C.-C. Wei, J.-J. Liu, L. Henrickson, W.-J.Huang, C.-L. Wang, Y.-K. Chen, and J. Chen, “Convolutional NeuralNetwork based Nonlinear Classifier for 112-Gbps High Speed OpticalLink,” in Optical Fiber Communication Conference, Optical Society ofAmerica, 2018, W2A–43.

[46] F. Lu, P.-C. Peng, S. Liu, M. Xu, S. Shen, and G.-K. Chang, “Inte-gration of Multivariate Gaussian Mixture Model for Enhanced PAM-4Decoding Employing Basis Expansion,” in Optical Fiber Communica-tion Conference, Optical Society of America, 2018, M2F–1.

[47] D. Wang, M. Zhang, Z. Li, C. Song, M. Fu, J. Li, and X. Chen, “Systemimpairment compensation in coherent optical communications by usinga bio-inspired detector based on artificial neural network and geneticalgorithm,” Optics Communications, vol. 399, pp. 1–12, 2017.

[48] T. S. R. Shen and A. P. T. Lau, “Fiber nonlinearity compensation usingextreme learning machine for DSP-based coherent communication sys-tems,” in Opto-Electronics and Communications Conference (OECC),2011 16th, IEEE, 2011, pp. 816–817.

102 Bibliography

[49] D. Wang, M. Zhang, M. Fu, Z. Cai, Z. Li, H. Han, Y. Cui, and B.Luo, “Nonlinearity mitigation using a machine learning detector basedon k-Nearest neighbors,” IEEE Photonics Technology Letters, vol. 28,no. 19, pp. 2102–2105, 2016.

[50] D. Wang, M. Zhang, Z. Li, Y. Cui, J. Liu, Y. Yang, and H. Wang, “Non-linear decision boundary created by a machine learning-based classifierto mitigate nonlinear phase noise,” in Optical Communication (ECOC),2015 European Conference on, IEEE, 2015, pp. 1–3.

[51] X. Wu, J. A. Jargon, R. A. Skoog, L. Paraschis, and A. E. Willner,“Applications of artificial neural networks in optical performance mon-itoring,” Journal of Lightwave Technology, vol. 27, no. 16, pp. 3580–3589, 2009.

[52] T. B. Anderson, A. Kowalczyk, K. Clarke, S. D. Dods, D. Hewitt, andJ. C. Li, “Multi impairment monitoring for optical networks,” Journalof Lightwave Technology, vol. 27, no. 16, pp. 3729–3736, 2009.

[53] J. A. Jargon, X. Wu, H. Y. Choi, Y. C. Chung, and A. E. Willner,“Optical performance monitoring of QPSK data channels by use ofneural networks trained with parameters derived from asynchronousconstellation diagrams,” Optics express, vol. 18, no. 5, pp. 4931–4938,2010.

[54] T. S. R. Shen, K. Meng, A. P. T. Lau, and Z. Y. Dong, “Optical per-formance monitoring using artificial neural network trained with asyn-chronous amplitude histograms,” IEEE Photonics Technology Letters,vol. 22, no. 22, pp. 1665–1667, 2010.

[55] F. N. Khan, T. S. R. Shen, Y. Zhou, A. P. T. Lau, and C. Lu, “Op-tical performance monitoring using artificial neural networks trainedwith empirical moments of asynchronously sampled signal amplitudes,”IEEE Photon. Technol. Lett., vol. 24, no. 12, pp. 982–984, 2012.

[56] V. Ribeiro, L. Costa, M. Lima, and A. L. Teixeira, “Optical perfor-mance monitoring using the novel parametric asynchronous eye dia-gram,” Optics Express, vol. 20, no. 9, pp. 9851–9861, 2012.

[57] T. Tanimura, T. Hoshida, T. Kato, S. Watanabe, and H. Morikawa,“Data analytics based optical performance monitoring technique foroptical transport networks,” in Optical Fiber Communication Confer-ence, Optical Society of America, 2018, Tu3E–3.

Bibliography 103

[58] C. Häger and H. D. Pfister, “Nonlinear interference mitigation via deepneural networks,” in 2018 Optical Fiber Communications Conferenceand Exposition (OFC), IEEE, 2018, pp. 1–3.

[59] H. Wymeersch, N. V. Irukulapati, D. Marsella, P. Johannisson, E.Agrell, and M. Secondini, “On the use of factor graphs in optical com-munications,” in Optical Fiber Communication Conference, Optical So-ciety of America, 2015, W4K–2.

[60] T. A. Eriksson, H. Bülow, and A. Leven, “Applying neural networksin optical communication systems: possible pitfalls,” IEEE PhotonicsTechnology Letters, vol. 29, no. 23, pp. 2091–2094, 2017.

[61] K. Petermann, Laser diode modulation and noise. Springer Science &Business Media, 2012, vol. 3.

[62] T. Pfau, S. Hoffmann, and R. Noé, “Hardware-efficient coherent digitalreceiver concept with feedforward carrier recovery forM -QAM constel-lations,” Journal of Lightwave Technology, vol. 27, no. 8, pp. 989–999,2009.

[63] J. G. Proakis, “Digital communications,” 1995.[64] Z. H. Peric, I. B. Djordjevic, S. M. Bogosavljevic, and M. C. Stefanovic,

“Design of signal constellations for Gaussian channel by using iterativepolar quantization,” in Electrotechnical Conference, 1998. MELECON98., 9th Mediterranean, IEEE, vol. 2, 1998, pp. 866–869.

[65] R. Schmogrow, S. Ben-Ezra, P. C. Schindler, B. Nebendahl, C. Koos,W. Freude, and J. Leuthold, “Pulse-shaping with digital, electrical, andoptical filters—a comparison,” Journal of lightwave technology, vol. 31,no. 15, pp. 2570–2577, 2013.

[66] B. Châtelain, C. Laperle, K. Roberts, M. Chagnon, X. Xu, A. Borowiec,F. Gagnon, and D. V. Plant, “A family of Nyquist pulses for coherentoptical communications,” Optics Express, vol. 20, no. 8, pp. 8397–8416,2012.

[67] D. J. Geisler, N. K. Fontaine, R. P. Scott, J. P. Heritage, K. Okamoto,and S. B. Yoo, “360-Gb/s optical transmitter with arbitrary modula-tion format and dispersion precompensation,” IEEE Photonics Tech-nology Letters, vol. 21, no. 7, pp. 489–491, 2009.

104 Bibliography

[68] Y. Gao, J. C. Cartledge, A. S. Karar, S. S.-H. Yam, M. O’Sullivan,C. Laperle, A. Borowiec, and K. Roberts, “Reducing the complexityof perturbation based nonlinearity pre-compensation using symmetricEDC and pulse shaping,” Optics express, vol. 22, no. 2, pp. 1209–1219,2014.

[69] A. Ghazisaeidi and R.-J. Essiambre, “Calculation of coefficients of per-turbative nonlinear pre-compensation for Nyquist pulses,” in OpticalCommunication (ECOC), 2014 European Conference on, IEEE, 2014,pp. 1–3.

[70] C. Laperle and M. O’Sullivan, “Advances in high-speed DACs, ADCs,and DSP for optical coherent transceivers,” Journal of Lightwave Tech-nology, vol. 32, no. 4, pp. 629–643, 2014.

[71] G. Khanna, B. Spinnler, S. Calabrò, E. De Man, and N. Hanik, “A ro-bust adaptive pre-distortion method for optical communication trans-mitters,” IEEE Photonics Technology Letters, vol. 28, no. 7, pp. 752–755, 2016.

[72] C. Cahn, “Combined digital phase and amplitude modulation com-munication systems,” IRE Transactions on Communications Systems,vol. 8, no. 3, pp. 150–155, 1960.

[73] M. Seimetz, High-order modulation for optical fiber transmission. Springer,2009, vol. 143.

[74] S. V. Manakov, “On the theory of two-dimensional stationary self-focusing of electromagnetic waves,” Soviet Physics-JETP, vol. 38, no. 2,pp. 248–253, 1974.

[75] O. V. Sinkin, R. Holzlohner, J. Zweck, and C. R. Menyuk, “Optimiza-tion of the split-step Fourier method in modeling optical-fiber com-munications systems,” Journal of lightwave technology, vol. 21, no. 1,pp. 61–68, 2003.

[76] Z. Tao, W. Yan, L. Liu, L. Li, S. Oda, T. Hoshida, and J. C. Ras-mussen, “Simple fiber model for determination of XPM effects,” Jour-nal of Lightwave Technology, vol. 29, no. 7, pp. 974–986, 2011.

[77] A. Splett, C. Kurtzke, and K. Petermann, “Ultimate Transmission Ca-pacity of Amplified Optical Fiber Communication Systems taking intoAccount Fiber Nonlinearities,” 1993.

Bibliography 105

[78] A. Mecozzi and R.-J. Essiambre, “Nonlinear Shannon limit in pseu-dolinear coherent systems,” Journal of Lightwave Technology, vol. 30,no. 12, pp. 2011–2024, 2012.

[79] A. Chraplyvy, A. Gnauck, R. Tkach, and R. Derosier, “8* 10 Gb/stransmission through 280 km of dispersion-managed fiber,” IEEE pho-tonics technology letters, vol. 5, no. 10, pp. 1233–1235, 1993.

[80] E. Ip and J. M. Kahn, “Digital equalization of chromatic dispersionand polarization mode dispersion,” Journal of Lightwave Technology,vol. 25, no. 8, pp. 2033–2043, 2007.

[81] S. J. Savory, “Digital filters for coherent optical receivers,” Optics ex-press, vol. 16, no. 2, pp. 804–817, 2008.

[82] F. Gardner, “A BPSK/QPSK timing-error detector for sampled re-ceivers,” IEEE Transactions on communications, vol. 34, no. 5, pp. 423–429, 1986.

[83] M. Yan, Z. Tao, L. Dou, L. Li, Y. Zhao, T. Hoshida, and J. C. Ras-mussen, “Digital clock recovery algorithm for Nyquist signal,” in Opti-cal Fiber Communication Conference, Optical Society of America, 2013,OTu2I–7.

[84] X. Zhou and C. Xie, Enabling technologies for high spectral-efficiencycoherent optical communication networks. John Wiley & Sons, 2016.

[85] M. S. Faruk and K. Kikuchi, “Adaptive frequency-domain equalizationin digital coherent optical receivers,” Optics Express, vol. 19, no. 13,pp. 12 789–12 798, 2011.

[86] J. H. Ke, K. P. Zhong, Y. Gao, J. C. Cartledge, A. S. Karar, andM. A. Rezania, “Linewidth-tolerant and low-complexity two-stage car-rier phase estimation for dual-polarization 16-QAM coherent opticalfiber communications,” Journal of lightwave technology, vol. 30, no. 24,pp. 3987–3992, 2012.

[87] M. Kuschnerov, F. N. Hauske, K. Piyawanno, B. Spinnler, M. S. Alfiad,A. Napoli, and B. Lankl, “DSP for coherent single-carrier receivers,”Journal of lightwave technology, vol. 27, no. 16, pp. 3614–3622, 2009.

[88] E. Ip and J. M. Kahn, “Compensation of dispersion and nonlinear im-pairments using digital backpropagation,” Journal of Lightwave Tech-nology, vol. 26, no. 20, pp. 3416–3425, 2008.

106 Bibliography

[89] Y. Fan, L. Dou, Z. Tao, T. Hoshida, and J. C. Rasmussen, “A high per-formance nonlinear compensation algorithm with reduced complexitybased on XPM model,” in Optical Fiber Communications Conferenceand Exhibition (OFC), 2014, IEEE, 2014, pp. 1–3.

[90] A. Napoli, Z. Maalej, V. A. Sleiffer, M. Kuschnerov, D. Rafique, E. Tim-mers, B. Spinnler, T. Rahman, L. D. Coelho, and N. Hanik, “Reducedcomplexity digital back-propagation methods for optical communica-tion systems,” Journal of lightwave technology, vol. 32, no. 7, pp. 1351–1362, 2014.

[91] T. M. Cover and J. A. Thomas, Elements of information theory. JohnWiley & Sons, 2012.

[92] T. A. Eriksson, T. Fehenberger, P. A. Andrekson, M. Karlsson, N.Hanik, and E. Agrell, “Impact of 4D channel distribution on the achiev-able rates in coherent optical communication experiments,” Journal ofLightwave Technology, vol. 34, no. 9, pp. 2256–2266, 2016.

[93] A. Alvarado, E. Agrell, D. Lavery, R. Maher, and P. Bayvel, “Replacingthe soft-decision FEC limit paradigm in the design of optical commu-nication systems,” Journal of Lightwave Technology, vol. 33, no. 20,pp. 4338–4352, 2015.

[94] S. K. Turitsyn, J. E. Prilepsky, S. T. Le, S. Wahls, L. L. Frumin, M.Kamalian, and S. A. Derevyanko, “Nonlinear Fourier transform foroptical data processing and transmission: advances and perspectives,”Optica, vol. 4, no. 3, pp. 307–322, 2017.

[95] A. Shabat and V. Zakharov, “Exact theory of two-dimensional self-focusing and one-dimensional self-modulation of waves in nonlinearmedia,” Soviet physics JETP, vol. 34, no. 1, p. 62, 1972.

[96] M. J. Ablowitz, D. J. Kaup, A. C. Newell, and H. Segur, “The inversescattering transform-Fourier analysis for nonlinear problems,” Studiesin Applied Mathematics, vol. 53, no. 4, pp. 249–315, 1974.

[97] M. I. Yousefi and F. R. Kschischang, “Information transmission usingthe nonlinear Fourier transform, Part I: Mathematical tools,” IEEETransactions on Information Theory, vol. 60, no. 7, pp. 4312–4328,2014.

Bibliography 107

[98] Z. Dong, S. Hari, T. Gui, K. Zhong, M. Yousefi, C. Lu, P.-K. A. Wai,F. Kschischang, and A. Lau, “Nonlinear frequency division multiplexedtransmissions based on NFT,” IEEE Photon. Technol. Lett., vol. 99,p. 1, 2015.

[99] J. E. Prilepsky, S. A. Derevyanko, and S. K. Turitsyn, “Nonlinear spec-tral management: Linearization of the lossless fiber channel,” Opticsexpress, vol. 21, no. 20, pp. 24 344–24 367, 2013.

[100] S. T. Le, J. E. Prilepsky, and S. K. Turitsyn, “Nonlinear inverse syn-thesis for high spectral efficiency transmission in optical fibers,” Opticsexpress, vol. 22, no. 22, pp. 26 720–26 741, 2014.

[101] S. T. Le, I. D. Philips, J. E. Prilepsky, P. Harper, A. D. Ellis, and S. K.Turitsyn, “Demonstration of nonlinear inverse synthesis transmissionover transoceanic distances,” Journal of Lightwave Technology, vol. 34,no. 10, pp. 2459–2466, 2016.

[102] T. Gui, C. Lu, A. P. T. Lau, and P. Wai, “High-order modulationon a single discrete eigenvalue for optical communications based onnonlinear Fourier transform,” Optics express, vol. 25, no. 17, pp. 20 286–20 297, 2017.

[103] S. T. Le, V. Aref, and H. Buelow, “Nonlinear signal multiplexing forcommunication beyond the Kerr nonlinearity limit,” Nature Photonics,vol. 11, no. 9, p. 570, 2017.

[104] S. Gaiarin, A. M. Perego, E. P. da Silva, F. Da Ros, and D. Zibar, “Dual-polarization nonlinear Fourier transform-based optical communicationsystem,” Optica, vol. 5, no. 3, pp. 263–270, 2018.

[105] T. Lakoba and D. Kaup, “Perturbation theory for the Manakov soli-ton and its applications to pulse propagation in randomly birefringentfibers,” Physical Review E, vol. 56, no. 5, p. 6147, 1997.

[106] J. Yang, “Multisoliton perturbation theory for the Manakov equationsand its applications to nonlinear optics,” Physical Review E, vol. 59,no. 2, p. 2393, 1999.

[107] S. A. Derevyanko, J. E. Prilepsky, and D. A. Yakushev, “Statistics ofa noise-driven Manakov soliton,” Journal of Physics A: Mathematicaland General, vol. 39, no. 6, p. 1297, 2006.

108 Bibliography

[108] H. Terauchi, Y. Matsuda, A. Toyota, and A. Maruta, “Noise toler-ance of eigenvalue modulated optical transmission system based ondigital coherent technology,” in Optical Fibre Technology, 2014 Opto-Electronics and Communication Conference and Australian Conferenceon, IEEE, 2014, pp. 778–780.

[109] W. Q. Zhang, Q. Zhang, K. Amir, T. Gui, X. Zhang, A. P. T. Lau,C. Lu, T. Chan, and S. Afshar, “Noise correlation between eigenvaluesin nonlinear frequency division multiplexing,” in Nonlinear Photonics,Optical Society of America, 2016, JM6A–13.

[110] T. Gui, T. H. Chan, C. Lu, A. P. T. Lau, and P.-K. A. Wai, “Alterna-tive decoding methods for optical communications based on nonlinearFourier transform,” Journal of Lightwave Technology, vol. 35, no. 9,pp. 1542–1550, 2017.

[111] J. Garcia and V. Aref, “Statistics of the Eigenvalues of a Noisy Multi-Soliton Pulse,” arXiv preprint arXiv:1808.03800, 2018.

[112] R. F. Fischer, Precoding and signal shaping for digital transmission.John Wiley & Sons, 2005.

[113] C. Pan and F. R. Kschischang, “Probabilistic 16-QAM shaping inWDM systems,” Journal of Lightwave Technology, vol. 34, no. 18, pp. 4285–4292, 2016.

[114] F. Buchali, F. Steiner, G. Böcherer, L. Schmalen, P. Schulte, and W.Idler, “Rate adaptation and reach increase by probabilistically shaped64-QAM: An experimental demonstration,” Journal of Lightwave Tech-nology, vol. 34, no. 7, pp. 1599–1609, 2016.

[115] J. Renner, T. Fehenberger, M. P. Yankov, F. Da Ros, S. Forchhammer,G. Böcherer, and N. Hanik, “Experimental comparison of probabilisticshaping methods for unrepeated fiber transmission,” Journal of Light-wave Technology, vol. 35, no. 22, pp. 4871–4879, 2017.

[116] R. Maher, K. Croussore, M. Lauermann, R. Going, X. Xu, and J. Rahn,“Constellation shaped 66 GBd DP-1024QAM transceiver with 400 kmtransmission over standard SMF,” in Optical Communication (ECOC),2017 European Conference on, IEEE, 2017, pp. 1–3.

[117] I. B. Djordjevic, H. G. Batshon, L. Xu, and T.Wang, “Coded polarization-multiplexed iterative polar modulation (PM-IPM) for beyond 400 Gb/sserial optical transmission,” in Optical Fiber Communication Confer-ence, Optical Society of America, 2010, OMK2.

Bibliography 109

[118] H. G. Batshon, I. B. Djordjevic, L. Xu, and T. Wang, “Iterative polarquantization-based modulation to achieve channel capacity in ultrahigh-speed optical communication systems,” IEEE Photonics Journal, vol. 2,no. 4, pp. 593–599, 2010.

[119] A. A. El-Rahman and J. Cartledge, “Multidimensional geometric shap-ing for QAM constellations,” in Optical Communication (ECOC), 2017European Conference on, IEEE, 2017, pp. 1–3.

[120] T. Lotz, X. Liu, S. Chandrasekhar, P. Winzer, H. Haunstein, S. Randel,S. Corteselli, B. Zhu, and D. Peckham, “Coded PDM-OFDM trans-mission with shaped 256-iterative-polar-modulation achieving 11.15-b/s/Hz intrachannel spectral efficiency and 800-km reach,” Journal ofLightwave Technology, vol. 31, no. 4, pp. 538–545, 2013.

[121] F. Steiner and G. Böcherer, “Comparison of geometric and probabilis-tic shaping with application to ATSC 3.0,” in SCC 2017; 11th Inter-national ITG Conference on Systems, Communications and Coding;Proceedings of, VDE, 2017, pp. 1–6.

[122] Z. Qu and I. B. Djordjevic, “Geometrically shaped 16QAM outper-forming probabilistically shaped 16QAM,” in Optical Communication(ECOC), 2017 European Conference on, IEEE, 2017, pp. 1–3.

[123] D. S. Millar, T. Fehenberger, T. Koike-Akino, K. Kojima, and K. Par-sons, “Coded Modulation for Next-Generation Optical Communica-tions,” in Optical Fiber Communication Conference, Optical Societyof America, 2018, Tu3C–3.

[124] F. Jardel, T. A. Eriksson, C. Measson, A. Ghazisaeidi, F. Buchali, W.Idler, and J.-J. Boutros, “Exploring and Experimenting with Shap-ing Designs for Next-Generation Optical Communications,” Journalof Lightwave Technology, 2018.

[125] S. Zhang, F. Yaman, E. Mateo, T. Inoue, K. Nakamura, and Y. Inada,“Design and performance evaluation of a GMI-optimized 32QAM,” inOptical Communication (ECOC), 2017 European Conference on, IEEE,2017, pp. 1–3.

[126] S. Zhang and F. Yaman, “Design and Comparison of Advanced Modu-lation Formats Based on Generalized Mutual Information,” Journal ofLightwave Technology, vol. 36, no. 2, pp. 416–423, 2018.

110 Bibliography

[127] B. Chen, C. Okonkwo, H. Hafermann, and A. Alvarado, “Increasingachievable information rates via geometric shaping,” arXiv preprintarXiv:1804.08850, 2018.

[128] B. Chen, C. Okonkwo, D. Lavery, and A. Alvarado, “Geometrically-shaped 64-point constellations via achievable information rates,” in2018 20th International Conference on Transparent Optical Networks(ICTON), IEEE, 2018, pp. 1–4.

[129] A. Shiner, M. Reimer, A. Borowiec, S. O. Gharan, J. Gaudette, P.Mehta, D. Charlton, K. Roberts, and M. O’Sullivan, “Demonstration ofan 8-dimensional modulation format with reduced inter-channel nonlin-earities in a polarization multiplexed coherent system,” Optics express,vol. 22, no. 17, pp. 20 366–20 374, 2014.

[130] O. Geller, R. Dar, M. Feder, and M. Shtaif, “A shaping algorithmfor mitigating inter-channel nonlinear phase-noise in nonlinear fibersystems,” Journal of Lightwave Technology, vol. 34, no. 16, pp. 3884–3889, 2016.

[131] K. Kojima, T. Yoshida, T. Koike-Akino, D. S. Millar, K. Parsons,M. Pajovic, and V. Arlunno, “Nonlinearity-tolerant four-dimensional2A8PSK family for 5–7 bits/symbol spectral efficiency,” Journal ofLightwave Technology, vol. 35, no. 8, pp. 1383–1391, 2017.

[132] E. Sillekens, D. Semrau, G. Liga, N. Shevchenko, Z. Li, A. Alvarado, P.Bayvel, R. Killey, and D. Lavery, “A simple nonlinearity-tailored prob-abilistic shaping distribution for square QAM,” in Optical Fiber Com-munication Conference, Optical Society of America, 2018, pp. M3C–4.

[133] E. Sillekens, D. Semrau, D. Lavery, P. Bayvel, and R. I. Killey, “Ex-perimental Demonstration of Geometrically-Shaped Constellations Tai-lored to the Nonlinear Fibre Channel,” in European Conf. Optical Com-munication (ECOC), 2018, Tu3G.3.

[134] M. P. Yankov, K. J. Larsen, and S. Forchhammer, “Temporal proba-bilistic shaping for mitigation of nonlinearities in optical fiber systems,”Journal of Lightwave Technology, vol. 35, no. 10, pp. 1803–1810, 2017.

[135] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,no. 7553, p. 436, 2015.

[136] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neu-ral networks, vol. 61, pp. 85–117, 2015.

Bibliography 111

[137] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica-tion with deep convolutional neural networks,” in Advances in neuralinformation processing systems, 2012, pp. 1097–1105.

[138] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Conference on, Ieee,2009, pp. 248–255.

[139] V. Nair and G. E. Hinton, “Rectified linear units improve restrictedboltzmann machines,” in Proceedings of the 27th international confer-ence on machine learning (ICML-10), 2010, pp. 807–814.

[140] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: a simple way to prevent neural networks from over-fitting,” The Journal of Machine Learning Research, vol. 15, no. 1,pp. 1929–1958, 2014.

[141] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A.Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al., “Deep neuralnetworks for acoustic modeling in speech recognition: The shared viewsof four research groups,” IEEE Signal processing magazine, vol. 29,no. 6, pp. 82–97, 2012.

[142] G. E. Dahl, T. N. Sainath, and G. E. Hinton, “Improving deep neuralnetworks for LVCSR using rectified linear units and dropout,” in Acous-tics, Speech and Signal Processing (ICASSP), 2013 IEEE InternationalConference on, IEEE, 2013, pp. 8609–8613.

[143] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learningwith neural networks,” in Advances in neural information processingsystems, 2014, pp. 3104–3112.

[144] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning represen-tations by back-propagating errors,” nature, vol. 323, no. 6088, p. 533,1986.

[145] G. Cybenko, “Approximations by superpositions of a sigmoidal func-tion,” Mathematics of Control, Signals and Systems, vol. 2, pp. 183–192, 1989.

[146] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” inAdvances in neural information processing systems, 1990, pp. 598–605.

[147] R. M. Neal, “Probabilistic inference using Markov chain Monte Carlomethods,” 1993.

112 Bibliography

[148] S. Haykin, Neural networks: a comprehensive foundation. Prentice HallPTR, 1994.

[149] C. M. Bishop, Neural networks for pattern recognition. Oxford univer-sity press, 1995.

[150] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learn-ing applied to document recognition,” Proceedings of the IEEE, vol. 86,no. 11, pp. 2278–2324, 1998.

[151] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient back-prop,” in Neural networks: Tricks of the trade, Springer, 2012, pp. 9–48.

[152] R. M. Neal, Bayesian learning for neural networks. Springer Science &Business Media, 2012, vol. 118.

[153] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas im-manent in nervous activity,” The bulletin of mathematical biophysics,vol. 5, no. 4, pp. 115–133, 1943.

[154] F. Rosenblatt, “The perceptron: a probabilistic model for informationstorage and organization in the brain.,” Psychological review, vol. 65,no. 6, p. 386, 1958.

[155] M. Minsky and S. A. Papert, Perceptrons: An introduction to compu-tational geometry. MIT press, 2017.

[156] C. E. Rasmussen, “Gaussian processes in machine learning,” in Ad-vanced lectures on machine learning, Springer, 2004, pp. 63–71.

[157] C. Robert, Machine learning, a probabilistic perspective, 2014.[158] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.

MIT press Cambridge, 2016, vol. 1.[159] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z.

Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiationin PyTorch,” in NIPS-W, 2017.

[160] Free courses and blog posts on machine learning and beyond, http://www.fast.ai/, Accessed: 2018-08-30.

[161] Open AI blog, https://blog.openai.com/, Accessed: 2018-08-30.[162] Machine learning blog by Google, https://medium.com/tensorflow,

Accessed: 2018-08-30.

http://www.fast.ai/

http://www.fast.ai/

https://blog.openai.com/

https://medium.com/tensorflow

Bibliography 113

[163] Blog on machine learning topics with nice visualisations, http://colah.github.io/, Accessed: 2018-08-30.

[164] Blog by machine learning research scientist, https://www.inference.vc/, Accessed: 2018-08-30.

[165] Blog by machine learning research scientist, http://dustintran.com/blog/, Accessed: 2018-08-30.

[166] Blog by machine learning research scientist, http://www.cs.ox.ac.uk/people/yarin.gal/website/blog.html, Accessed: 2018-08-30.

[167] Machine learning podcast, http://www.thetalkingmachines.com/,Accessed: 2018-08-30.

[168] A. K. Jain and R. C. Dubes, “Algorithms for clustering data,” 1988.[169] V. Hodge and J. Austin, “A survey of outlier detection methodologies,”

Artificial intelligence review, vol. 22, no. 2, pp. 85–126, 2004.[170] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of cog-

nitive neuroscience, vol. 3, no. 1, pp. 71–86, 1991.[171] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A.

Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., “Mastering thegame of Go without human knowledge,” Nature, vol. 550, no. 7676,p. 354, 2017.

[172] R. S. Sutton and A. G. Barto, Introduction to reinforcement learning.MIT press Cambridge, 1998, vol. 135.

[173] F. A. Aoudia and J. Hoydis, “End-to-End Learning of CommunicationsSystems Without a Channel Model,” arXiv preprint arXiv:1804.02276,2018.

[174] D. J. MacKay and D. J. Mac Kay, Information theory, inference andlearning algorithms. Cambridge university press, 2003.

[175] H. Robbins and S. Monro, “A stochastic approximation method,” inHerbert Robbins Selected Papers, Springer, 1985, pp. 102–109.

[176] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfel-low, and R. Fergus, “Intriguing properties of neural networks,” arXivpreprint arXiv:1312.6199, 2013.

[177] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y.Bengio, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization,” in Advances in neural informa-tion processing systems, 2014, pp. 2933–2941.

http://colah.github.io/

http://colah.github.io/

https://www.inference.vc/

https://www.inference.vc/

http://dustintran.com/blog/

http://dustintran.com/blog/

http://www.cs.ox.ac.uk/people/yarin.gal/website/blog.html

http://www.cs.ox.ac.uk/people/yarin.gal/website/blog.html

http://www.thetalkingmachines.com/

114 Bibliography

[178] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms forhyper-parameter optimization,” in Advances in neural information pro-cessing systems, 2011, pp. 2546–2554.

[179] J. Bergstra and Y. Bengio, “Random search for hyper-parameter op-timization,” Journal of Machine Learning Research, vol. 13, no. Feb,pp. 281–305, 2012.

[180] P. Baldi, “Autoencoders, unsupervised learning, and deep architec-tures,” in Proceedings of ICML workshop on unsupervised and transferlearning, 2012, pp. 37–49.

[181] T. O’Shea and J. Hoydis, “An introduction to deep learning for thephysical layer,” IEEE Transactions on Cognitive Communications andNetworking, vol. 3, no. 4, pp. 563–575, 2017.

[182] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activationfunctions,” 2018.

[183] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, et al., Gradientflow in recurrent nets: the difficulty of learning long-term dependencies,2001.

[184] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neuralnetworks,” in Proceedings of the fourteenth international conference onartificial intelligence and statistics, 2011, pp. 315–323.

[185] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of trainingrecurrent neural networks,” in International Conference on MachineLearning, 2013, pp. 1310–1318.

[186] N. Qian, “On the momentum term in gradient descent learning algo-rithms,” Neural networks, vol. 12, no. 1, pp. 145–151, 1999.

[187] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” arXiv preprint arXiv:1412.6980, 2014.

[188] M. Bartholomew-Biggs, S. Brown, B. Christianson, and L. Dixon, “Au-tomatic differentiation of algorithms,” Journal of Computational andApplied Mathematics, vol. 124, no. 1-2, pp. 171–190, 2000.

[189] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind,“Automatic differentiation in machine learning: a survey.,” Journal ofmachine learning research, vol. 18, no. 153, pp. 1–153, 2017.

Bibliography 115

[190] O. Golani, R. Dar, M. Feder, A. Mecozzi, and M. Shtaif, “Modeling thebit-error-rate performance of nonlinear fiber-optic systems,” Journal ofLightwave Technology, vol. 34, no. 15, pp. 3482–3489, 2016.

[191] O. Golani, M. Feder, A. Mecozzi, and M. Shatif, “Correlations andphase noise in NLIN-modelling and system implications,” in OpticalFiber Communication Conference, Optical Society of America, 2016,W3I–2.

[192] R. T. Jones, Source code, https://github.com/rassibassi/claude,2018.

[193] S. Li, C. Häger, N. Garcia, and H. Wymeersch, “Achievable InformationRates for Nonlinear Fiber Communication via End-to-end AutoencoderLearning,” in European Conf. Optical Communication (ECOC), 2018,We1F.3.

[194] B. P. Karanov, M. Chagnon, F. Thouin, T. A. Eriksson, H. Buelow,D. Lavery, P. Bayvel, and L. Schmalen, “End-to-end Deep Learningof Optical Fiber Communications,” Journal of Lightwave Technology,pp. 1–1, 2018.

[195] T. J. O’Shea, T. Roy, N. West, and B. C. Hilburn, “Physical LayerCommunications System Design Over-the-Air Using Adversarial Net-works,” arXiv preprint arXiv:1803.03145, 2018.

[196] D. S. Millar, R. Maher, D. Lavery, T. Koike-Akino, M. Pajovic, A.Alvarado, M. Paskov, K. Kojima, K. Parsons, B. C. Thomsen, S. J.Savory, and P. Bayvel, “Design of a 1 Tb/s superchannel coherent re-ceiver,” Journal of Lightwave Technology, vol. 34, no. 6, pp. 1453–1463,2016.

[197] M. P. Yankov, E. P. da Silva, F. Da Ros, and D. Zibar, “Experimentalanalysis of pilot-based equalization for probabilistically shaped WDMsystems with 256QAM/1024QAM,” in Optical Fiber CommunicationsConference and Exhibition (OFC), 2017, IEEE, 2017, pp. 1–3.

[198] D. Angluin, “Queries and concept learning,” Machine learning, vol. 2,no. 4, pp. 319–342, 1988.

[199] N. A. H. Mamitsuka et al., “Query learning strategies using boostingand bagging,” in Machine learning: proceedings of the fifteenth interna-tional conference (ICML’98), Morgan Kaufmann Pub, vol. 1, 1998.

https://github.com/rassibassi/claude

116 Bibliography

[200] T. J. O’Shea, L. Pemula, D. Batra, and T. C. Clancy, “Radio trans-former networks: Attention models for learning to synchronize in wire-less systems,” in Signals, Systems and Computers, 2016 50th AsilomarConference on, IEEE, 2016, pp. 662–666.

[201] E. Agrell, A. Alvarado, G. Durisi, and M. Karlsson, “Capacity of anonlinear optical channel with finite memory,” Journal of LightwaveTechnology, vol. 32, no. 16, pp. 2862–2876, 2014.

Documents

Machine Learning Methods in Coherent Optical Communication Systems · Machine Learning Methods in Coherent Optical Communication Systems Jones, Rasmus Thomas Publication date: 2019