Background and Motivation

Why Should We Care About VAEs for Music?

What is useful about Variational Autoencoders (VAEs) for the audio and music domain? Variational Autoencoders (VAEs) are a type of generative model which learn to map data into a lower-dimensional, structured latent space. They consist of two key components:

Encoder: Transforms input data into a distribution in the latent space, modeled as a multivariate Gaussian distribution

Decoder: Reconstructs the data from values in the latent distribution

A graph of VAE architecture — Standard VAE Architecture

VAEs use a regularization term, the Kullback-Leibler (KL) divergence, to encourage the latent space to approximate a prior, the standard normal distribution. This enourages smoothness and continuity in the latent space, where similar points correspond to similar data. Unlike traditional autoencoders, where the latent space lacks an explicit probabilistic organization, VAEs are able to generate entirely new data points by sampling from the latent space rather than directly reconstructing a fixed encoding. VAEs have been extensively studied for the image domain but slightly less so for the audio domain which comes with its own challenges. The most useful way to represent audio for the purposes of human analysis is as a spectrogram which displays time on one axis, frequency on the other, and each point takes on a value corresponding to the amplitude of that frequency at that time. Notably, unlike images, the two axis represent completely different features of the audio. Additionally, there's only translation invariance in the time dimension. This presents challenges that to not are not apparent in the image domain.

The property of being able to sample new points from a latent distribution makes VAEs particular salient for the tasks of music generation. VAEs excel in learning smooth latent representations, which aid in producing meaningful representations of important audio attributes for music like pitch, rythm, etc. This means that when reconstructing audio from the latent space, an artist using such a VAE can in theory pick and choose attributed that they like or would like to tweak with the decoded sound simply by moving slightly in the latent space in the dimension that controls the desired features.

Due to this motivation, we seek to compare the capability of different encoder/decoder architecture of VAEs for the task of music audio generation. We explore variations on standard VAE models including adapting kernel sizes in order to fit the audio domain better, implementing residual blocks into the architecture, as well as variations on loss functions. We also explore the capabilities of VAEs for transfer learning in the audio domain, which as we've discussed, is inherently different from that of the image of domain.

Related Works

Different VAE architectures have been proposed to optimize latent representation quality. RAVE, proposed by Caillon et al, finds success in high-quality audio generation using VAEs mixed with GANs [1].

Transfer learning with VAEs has been studied extensively in the image domain, but its application to audio remains underexplored. Research such as "How Good Are VAEs at Transfer Learning?" the representational similarity of VAE components (encoders and decoders) across different datasets using Centred Kernel Alignment (CKA) [2]. It concludes that encoders produce generic representations suitable for transfer, while decoders are more task-specific. This offers guidance on the roles of both VAE components and methods for effective transfer learning, although this does not focus on the audio domain. Similarly, Hsu et al's work on "Learning Latent Representations for Speech Generation and Transformation" introduces 1D kernels to distinguish between time and frequency dimensions [3]. This offers a foundation for applying VAEs to audio, but the implications for cross-domain audio tasks are not fully explored.

Strong techniques for improving VAE performance have been proposed, such as β-annealing, used by Sankarapandian et al. in "β-Annealed Variational Autoencoder for glitches" [4]. β-annealing involves progressively increasing the β-term during training to balance reconstruction quality and KL divergence. The β-term wieghs KL divergence to encourage a normal latent space distribution. They find that this annealing schedule mimics the effects of gradually increasing information capacity in Bottleneck-VAEs. While this approach has been applied to spectrogram representations for gravitational wave detection, its application to music spectrograms remains unexplored. Additionally, using 1D kernels to separate time and frequency dimensions has shown promise in speech processing but has not been fully investigated for musical audio generation.

The current literature on VAE presents a clear gap in a definitive comparison of the quality of different encoder and decoder architectures for the purposes of generation. This type of generation comparison in the audio domain remains even more under explored. Of the VAEs used to generate audio, few have been used for music. Those that focus on music generation such as Jukebox[5] or RAVE[1] typically use more complicated models and do not focus on the comparison of different encoder and decoder architectures; or they do not focus on the raw wave representation of the data, such as MusicVAE, which uses MIDI inputs instead [6]. Additionally, many other advanced techniques have only properly been explored in the image domain. This includes using residual blocks in the encoder and decoder architecture, beta-annealing for the loss, as well as other applications of VAEs which we investigate such as the potential for transfer learning and the generality of learned latent spaces.

Data and preprocessing

For this project, we used audio data from the Medley-solos-DB, a database containing 21,572 mono audio clips sampled at 44.1 kHz with a 32-bit depth [7]. Each audio clip has a fixed duration of 2972 milliseconds, or 65536 discrete-time samples, providing a uniform and manageable input size for our model. These clips are split by instrument, allowing us to isolate individual instruments for training and generation tasks.

To prepare the data for training, we performed several preprocessing steps. The audio data was first converted into spectrograms, which provide a time-frequency representation of the audio. This transformation is crucial for working with audio in machine learning, as raw waveform data can be difficult for models to process effectively.

A spectrogram is essentially a visual representation of the frequencies that are present in an audio signal over time. It is computed by applying a Fourier transform to the audio signal, breaking it down into its frequency components across small windows of time. The Fourier transform itself is a mathematical operation that converts a time-domain signal (which shows how the signal varies with time) into a frequency-domain signal (which shows how much of each frequency is present at any given time).

To create the spectrograms, we used the following parameters:

n_mels = 512: This sets the number of Mel-frequency bins, which are a logarithmic scale that mimics the human ear's sensitivity to sound. The Mel scale is designed to capture the perceptually significant frequencies, and 512 bins offer a good balance between frequency resolution and computational efficiency.
n_fft = 4096 : This is the size of the FFT window. A larger window size results in higher frequency resolution, which is useful for capturing fine details in the frequency domain.
hop_len = 1024: The hop length is the number of samples between successive frames. A smaller hop length allows for finer temporal resolution, which is important for capturing the dynamics of music.
sr = 22050: The sample rate of the spectrogram was set to 22,050 Hz, which is commonly used in music-related tasks for balancing time and frequency resolution.

Using these settings, we generated Mel spectrograms with the dimensions (512, 64), where 512 corresponds to the number of frequency bins (on the Mel scale) and 64 represents the number of time frames, effectively reducing the dimensionality of the raw audio while retaining crucial frequency and temporal information. These spectrograms were then used as input data for training our Variational Autoencoder (VAE) model. The goal of this preprocessing was to create spectrograms that are small in dimension but informative enough for robust reconstruction during the VAE training process. This was important for ensuring that the latent space of the VAE could capture the musical features needed to generate high-quality audio outputs.

Figure 1: An example Mel spectorgram taken from the piano training data
(NOTE: Click image to play audio)

Model Architechtures and Loss Variations

We explore four main kinds of encoder/decoder architectures. We found early on that for our purposes, a small latent dimension performed best in being able to best generate new data as well still have sufficient reconstruction quality over various different models. Therefore, we trained all models using a latent dimension of 8. Simiarly, we used a batch size of 64 which allowed us to speed up training slightly by better utilizing gpu while still ensuring that we're not averaging the gradient step over too many samples and smoothing too much. The first encoder/decoder architecture that we considered was a fully linear model, which flattens the input image, then has a series of linear layers, each decreasing in width, until it reaches the desired latent dimension. The decoder simply has the same linear layers in the reverse direction. After each linear layer, we use the ReLU nonlinearity. Throughout the result of the results section, we'll refer to this as the linear model, the full architecture of which is described in Appendix A.

We then consider a CNN-based encoder and decoder which contains a series of convolutional layers, downsizing the image while also increasing in the number of channels. The result is then flattened and linear layers are applied in order to reach the desired latent dimension. In the decoder, we use deconvolutional layers in order to upsize the image and reduce the number of channels. After each (de)convolutional layer, we apply a BatchNorm layer as well as a ReLU nonlinearity. We'll refer to this first, more basic convolutional model as the conv model and the complete architecture can be found in Appendix B.

The next architecture that we tested is a more audio-domain-aware CNN approach. In particular, drawing inspiration from Hsu W et al [3], we adapt the more fundemental convolutional architecture above to be more fitted for the audio domain by downsampling more gradually and using 1D kernels with varying sizes. Intuitively, the two dimensions of a spectrogram, time and frequency, represent very different things which is what separates the audio domain from the image domain. As such, it makes sense to treat them differently by allowing them to largely have seperate kernels. Additionally, we implement larger kernel sizes early in the encoder and deep in the decoder in the hopes of being able to better capture finer details in the frequency and time dimensions such as more accurate timbre representation as well as quick note changes in the audio. Like the previous model, BatchNorm is applied after each (de)convolutinal layer and ReLU applied after that. Due to these awarenesses of the audio domain, we refer to this model as the ADA conv model or audio domain aware conv model. The full architecture can be found in Appendix C.

The last architecture that we implemented was one that incorporates residual blocks into both the encoder and decoder, inspired by the work of Vahdat A. and Kautz J. on incorporating residual blocks into the decoder of a VAE trained on the CelebA HQ dataset [8]. Similar to the ADA conv architecture, the hope is that by incorporating skip connections via residual blocks, we're able to better retain finer information from early convolutional layers that should be capturing details such as finer timbre information and quick note changes in the frequency and time dimensions respectively. Therefore, for the encoder, we implement early residual layers in order to capture this information. For the decoder, the residual blocks are more evenly spread throughout the layers or order to retain both salient features learned in the latent space as well as being able to successfully back out and decode more of the intermediate features and fine grained that the model learns. We refer to this model as the resid model and the complete architecture can be found in Appendix D.

In order to train our models, we used a technique called beta-annealing. The traditional VAE loss is given by the reconstruction loss + KL divergence term. However, there's a variation known as beta-VAE which slightly modifies the loss to be the reconstruction term + beta * KL divergence term. What beta-annealing does is go one step further and vary beta throughout the training process. In particular, beta starts low and is increased over time. This ensures that the first priority of the model is accurate reconstruction. This is particularly important for a task such as audio generation since if the model cannot recreate quality audio on its training data, it will have no hope of being able to generate quality sounding data by sampling from the latent space.

Therefore, we chose to first linearly increase beta from 0.05 at the start to 0.20 over the first 150 epochs. This encourages the model to learn accurate reconstructions while still restricting the latent space to being approximately Gaussian. We then keep beta constant at 0.20 until epoch 1000 which ensures that the model has achieved near-optimal loss for these hyperparameters. Then, we increase beta by 0.20 every 25 epochs for the next 450 epochs. The nonlinear jumps in beta were selected as we noticed that the KL divergence term very quickly adapts to the new value of beta and until the next jump, does not decrease much over the subsequent epochs. We choose to hold each beta for 25 epochs as we found that that is approximately how long it took the reconstruction loss to heavily plateau after the change in beta. Lastly, we chose to increase beta nine times (over 450 epochs) as this was approximately before the point where we observed a drop in quality of the audio reconstruction. The full training scheme as well as loss terms can be observed in Figure 2 below. By training according to this beta-annealing scheme, we're able to ensure the latent space is as close to Gaussian as possible (which encourages generalization and quality sampling) without sacrificing subjective audio reconstruction quality.

Description of the image — **Figure 2:** Training of Piano Conv Model with beta-annealing, losses plotted left, beta on right

Experimental Results

In order to compare the performance of the architectures, we focused primarily on models trained on the piano dataset since that's what we had the most data for. Let's first look at (and listen to!) the quality of the reconstruction of a few of the training data points in Figure 3 below.

Figure 3: Comparison of reconstructed outputs from each model
(NOTE: Click image to play audio)

We observe that for the first two samples, all four of our models reconstruct the original audio quite accurately. This is expected since we're prioritizing reconstruction accuracy in our beta-annealing scheme. For the third sample, the true audio is a bit more complex and in particular, involves quick note changes. For this sample, our subjective ranking for audio reconstruction quality is (from best to worst) the ADA conv model, linear model, and then the resid model and conv model in some order. It's surprising to us that the residual model does not perform as well given that in theory, it should be able to capture even more intricacies than the ADA conv model. Notably, both the residual model as well as the convolutional model fail to capture the quick note changes present in the true audio.

Now, let's examine the quality of the reconstructed audio of each of these models. We do this by sampling points in latent space accoriding to a Gaussian distribution in the latent dimension. The results of these decoded samples are visualized and heard in Figure 4 below.

Figure 4: Comparison of outputs generated from random normal samples of the latent space, for 4 models
(NOTE: Click image to play audio)

Notice that the linear model, despite doing well in reconstruction, has subjectively the worst quality audio reconstruction. In particular, the model appears to be unable to accurately capture the timbre of a piano although it is able to get reasonable sounding rhythmic qualities. This implies that the model was overfitting to the training data. This is expected as a linear model is unable to learn the local structure of the spectrograms and how the different frequencies interact with one another in ways such as learning proper harmonics. Next in quality, we find the the more basic convolutional model. Like the linear model, this model appears to accurately reconstruct temporal features but is lacking in quality harmonically and its reconstructed audios are not convincingly piano (although it does perform better than the linear model in this respect).

The ADA conv model and the residual model performed similarly in the quality of the audio reconstructions and had strengths in slightly different areas. As we observed in the reconstruction, the residual model does not excell in producing audio with quick note changes with distinct temporal starts and often blends such sounds together. That being said, as can be observed in the first sample for the model, it does produce features that appear to runs on piano, granted, these do sound very blended. These intesting features were not observed in other samples from the ADA conv model. The ADA conv model, on the other hand, is capable of generating audio with faster note changes, as can be observed in its second sample. Such samples were much more abundant than in the residual model. By listening to a plethora of samples from both models, we opted to proceed with the ADA conv model for further testing although both models excelled in different ways for the task of audio generation.

Now that we have a model to continue with, let's explore some of the properties of it. Arguably the most important and most interesting property of a VAE is it's latent space. First, we'll encode all of the training data and plot each pair of latent dimensions against one another as shown below in Figure 5.

Figure 5: Pairwise plot of the encodings of the training data across all latent dimensions

We observe that the result is unit Gaussian which is expected since we incur loss for how far away this latent space is from unit Gaussian. Arguably the most interesting aspect of VAEs, and their particular interest for music generation, is the ability to deliberately move in latent space in order to select for certain features. Below in Figure 6, we provide an interactive graphic to select any two latent dimensions and both visualize and hear how the aspects of the decoded sound change across latent dimensions. For dimensions that are not selected, their encoded values are set to 0.

Figure 6: Latent Dimension Analysis for Piano samples, using the ADA Conv architecture
(NOTE: Click image to play audio)

Lastly for the evaluation of the quality of reconstructed audio for these architectures, we'll take a look at how the ADA conv architecture performs on other instruments. In particular, we also train a model with this architecture for both flute and violin since these are the other instruments that we have significant data for. We sample the latent space and decode the same random vector using each model. The results are below in Figure 7.

Figure 7: Comparison of outputs generated from random normal samples of the latent space for different instruments
(NOTE: Click image to play audio)

Listening to the reconstructed audios, we subjectively determien that the piano model performs the best. The flute and violin model both perform similarly and have successes and faults in similar areas. In particular, when there is a consistent tone or very slow moving notes, both models seems to accurately reconstruct the timbre of their respective instruments well. However, for samples where there is more apparent movement or more complex sounds, the models struggle to resemble their instruments harmonically. As a notable aside which brings us into the next section, note that despite decoding the same point in latent space, the reconstructed audios sound very differnt from each other. This leads us to question whether these models still learn similar or compatible encodings in some sense. If so, then it would be possible to do transfer learning with these models.

Transfer Learning in Audio

Clearly as the models currently stand, the same encoded point in latent space maps to different features in the reconstructed spectrogram space. Intuitively, this makes sense as we wouldn't expect the models to necessarily encode the same features of the spectrogram into the same latent dimensions or even necessarily to learn the same features at all. However, if the latent spaces are similar in some sense then it should be possible to train a general purpose encoder to work for a larger set of audio and apply it to more domain specific tasks such as individual instruments or potentially even on larger classes such as genre. This would enable less training since only one half of the model (the decoder) would need to be trained.

In order to test this, we passed piano audio through the flute encoder and decoded with the piano decoder, training a linear neural network to interpolate between the flute's latent space and the piano's latent space. Intuitively, we expect different domains of audio to share similar qualities which may be embedded in the latent space (in the domain of instrument audio this is rythm, pitch, etc), so all that is needed to transfer between domains is translation between latent spaces.

In particular, we took the ADA conv model for both the piano and the flute and constructed a transfer VAE which, on an input, passes it through the flute's encoder, then a small (two hidden layers of width 128) neural net, then the piano's decoder. In training, we froze the parameters of the encoder and decoder so that it was only necessarily to train the neural net in the latent dimensions transfer layer, significantly reducing time. For training, since we're using the piano decoder, we train on piano audio and only consider the MSE loss between the original spectrogram and the reconstructed spectrogram since the assumption is that the latent space that the flute encoder maps to is already approximately Gaussian. In Figure 8 below, we display the results of this experiment, providing the original audio, the audio from encoding with both a piano encoder and decoder, as well as the reconstructed audio from our transfer model which uses a flute encoder, small neural net, and piano decoder.

Figure 8: Comparison of the original audio compared to its encoding and decoding with a piano only model as well as our transfer model
(NOTE: Click image to play audio)

We notice that the transfer model does in fact roughly reconstruct the original audio although it is not perfect. In particular, it seems to capture timbrally what the audio sounds like but struggles a bit in the time dimension. In the first reconstruction, the chord hits are misaligned over the frequencies and in the second reconstruction, the reconstruction does not accurately capture the quick note changes. Despite this, the resulting audios are still recognizable as the originals. This suggest that different domains which share similar overall properties (such as pitch or chords in the instrument audio domain) have latent spaces which are able to translate global properties, providing hope for the possibility of using VAEs for transfer learning.

Implications, Limitations, and Future Work

Through the evauluation of these audio music domains across different architectures, we demonstrate how domain-aware modifications—such as 1D kernels and residual blocks—can enhance model performance for capturing the nuanced properties of audio data. The successful translation of latent spaces between instruments further highlights VAEs' promise in reducing retraining efforts for cross-domain tasks if sufficient similarities can be found.

However, our experiments also revealed several limitations. While the transfer model was able to approximate reconstructions across domains, it still struggled with temporal precision in complex audio patterns such as those with quick note changes.

Another challenge lies in understanding and quantifying the similarities between latent spaces across domains. Although our approach of using a small neural network to map between latent spaces showed promise, it requires further validation and generalization across diverse audio datasets and tasks.

Although we were constrained by resources (both in compute and overall time), our results show potential. However, the computational cost of training these very domain-specific models presents a scalability concern, especially given the relatively large depth of some architectures we used. Beyond this, it may be diffucult to train models as the number of epochs needed to train effectively, the size of data available, and the size of each sample (roughly 2 second) scale.

Future Work

While our finding show promise, future research should explore hybrid architectures that integrate convolutional layers for initial feature extraction with residual blocks for deeper, more nuanced processing. Hybrid designs like this may have the potential to combine the strengths of temporal precision from convolutional models and the harmonic accuracy seen using residual models.

Beyond this, improving the temporal accuracy of the residual models may be an area for advancement. This could include fine-tuning the placement and configuration of residual connections, as well as integrating attention mechanisms to better capture rapid note transitions and detailed temporal dependencies in audio data.

Expanding the scope of transfer learning research to include more diverse audio datasets and domains is another priority. Formalizing metrics to quantify latent space similarities and ensuring consistent representations across models trained on varied data will be crucial. More work may be done on efficiently translating between these latent spaces.

Lastly, addressing computational limitations is essential for scaling these approaches. We may attempt to reduce the size of these models to reduce training overhead without limiting performance.

Appendix

Linear Model Architecture for Encoder and Decoder
Layer Type	Layer Name	Input Dimension	Output Dimension	Notes
Encoder	Flatten	(512x64)	32768	Flatten spectrogram
Encoder	Linear1	32768	4096
Encoder	Linear2	4096	1024
Encoder	Linear3	1024	256
Bottleneck	fc_mu	256	8	Encode mu
Bottleneck	fc_logvar	256	8	Encode logvar
Decoder	Linear1	8	256
Decoder	Linear2	256	1024
Decoder	Linear3	1024	4096
Decoder	Linear1	4096	32768
Decoder	Unflatten	32768	(512x64)	Final reconstructed output

Convolutional Model Architecture for Encoder and Decoder
Layer Type	Layer Name	Input Dimension	Output Dimension	Notes
Encoder	Conv1	(1x512x64)	(16x512x64)	kernel_size=(3,3), stride=(1,1)
Encoder	Conv2	(16x512x64)	(32x128x32)	kernel_size=(4,4), stride=(4,2)
Encoder	Conv3	(32x128x32)	(64x32x16)	kernel_size=(4,4), stride=(4,2)
Encoder	Conv4	(64x32x16)	(128x16x8)	kernel_size=(3,3), stride=(2,2)
Encoder	Conv5	(128x16x8)	(128x8x8)	kernel_size=(3,1), stride=(2,1)
Encoder	Flatten	(128x8x8)	8196	Flatten
Encoder	Linear1	8196	1024
Encoder	Linear2	1024	256
Bottleneck	fc_mu	256	8	Encode mu
Bottleneck	fc_logvar	256	8	Encode logvar
Decoder	Linear1	8	256
Decoder	Linear2	256	1024
Decoder	Linear3	1024	8196
Decoder	Unflatten	8196	(128x8x8)	Unflatten
Decoder	Deconv1	(128x8x8)	(128x16x8)	kernel_size=(3,1), stride=(2,1)
Decoder	Deconv2	(128x16x8)	(64x32x16)	kernel_size=(3,3), stride=(2,2)
Decoder	Deconv3	(64x32x16)	(32x128x32)	kernel_size=(5,5), stride=(4,2)
Decoder	Deconv4	(32x128x32)	(16x512x64)	kernel_size=(5,5), stride=(4,2)
Decoder	Deconv5	(16x512x64)	(1x512x64)	kernel_size=(1,1), stride=(1,1)

Audio Domain Aware Convolutional Model Architecture for Encoder and Decoder
Layer Type	Layer Name	Input Dimension	Output Dimension	Notes
Encoder	Conv1	(1x512x64)	(16x512x64)	kernel_size=(3,3), stride=(1,1)
Encoder	Conv2	(16x512x64)	(32x256x64)	kernel_size=(5,1), stride=(2,1)
Encoder	Conv3	(32x256x64)	(64x256x32)	kernel_size=(1,5), stride=(1,2)
Encoder	Conv4	(64x256x32)	(128x128x32)	kernel_size=(3,1), stride=(2,1)
Encoder	Conv5	(128x128x32)	(128x64x32)	kernel_size=(3,1), stride=(2,1)
Encoder	Conv6	(128x64x32)	(128x64x16)	kernel_size=(1,3), stride=(1,2)
Encoder	Conv7	(128x64x16)	(128x32x16)	kernel_size=(3,1), stride=(2,1)
Encoder	Conv8	(128x32x16)	(128x16x16)	kernel_size=(3,1), stride=(2,1)
Encoder	Flatten	(128x16x16)	32768	Flatten
Encoder	Linear1	32768	2048
Encoder	Linear2	2048	256
Bottleneck	fc_mu	256	8	Encode mu
Bottleneck	fc_logvar	256	8	Encode logvar
Decoder	Linear1	8	256
Decoder	Linear2	256	2048
Decoder	Linear3	2048	32176
Decoder	Unflatten	32176	(128x16x16)	Unflatten
Decoder	Deconv1	(128x16x16)	(128x32x16)	kernel_size=(3,1), stride=(2,1)
Decoder	Deconv2	(128x32x16)	(128x64x16)	kernel_size=(3,1), stride=(2,1)
Decoder	Deconv3	(128x64x16)	(128x64x32)	kernel_size=(1,3), stride=(1,2)
Decoder	Deconv4	(128x64x32)	(128x128x32)	kernel_size=(3,1), stride=(2,1)
Decoder	Deconv5	(128x128x32)	(64x256x32)	kernel_size=(3,1), stride=(2,1)
Decoder	Deconv6	(64x256x32)	(32x256x64)	kernel_size=(1,5), stride=(1,2)
Decoder	Deconv7	(32x256x64)	(16x512x64)	kernel_size=(5,1), stride=(2,1)
Decoder	Deconv8	(16x512x64)	(1x512x64)	kernel_size=(3,3), stride=(1,1)

Audio Domain Aware Residual Model Architecture for Encoder and Decoder
Layer Type	Layer Name	Input Dimension	Output Dimension	Notes
Encoder	Resid1	(1x512x64)	(16x256x32)	kernel_size=(3,3), stride=(2,2)
Encoder	Resid2	(16x256x32)	(32x128x32)	kernel_size=(5,1), stride=(2,1)
Encoder	Resid3	(32x128x32)	(64x128x16)	kernel_size=(1,5), stride=(1,2)
Encoder	Resid4	(64x128x16)	(128x64x16)	kernel_size=(3,1), stride=(2,1)
Encoder	Conv1	(128x64x16)	(128x64x8)	kernel_size=(1,3), stride=(1,2)
Encoder	Conv2	(128x64x8)	(128x32x8)	kernel_size=(3,1), stride=(2,1)
Encoder	Conv3	(128x32x8)	(128x16x8)	kernel_size=(3,1), stride=(2,1)
Encoder	Flatten	(128x16x8)	16384	Flatten
Encoder	Linear1	16384	2048
Encoder	Linear2	2048	256
Bottleneck	fc_mu	256	8	Encode mu
Bottleneck	fc_logvar	256	8	Encode logvar
Decoder	Linear1	8	256
Decoder	Linear2	256	2048
Decoder	Linear3	2048	16384
Decoder	Unflatten	16384	(128x16x8)	Unflatten
Decoder	Resid1	(128x16x8)	(128x16x8)	kernel_size=(3,3), stride=(1,1)
Decoder	Deconv1	(128x16x8)	(128x32x8)	kernel_size=(3,1), stride=(2,1)
Decoder	Deconv2	(128x32x8)	(128x64x8)	kernel_size=(3,1), stride=(2,1)
Decoder	Deconv3	(128x64x8)	(128x64x16)	kernel_size=(1,3), stride=(1,2)
Decoder	Resid2	(128x64x16)	(128x64x16)	kernel_size=(3,3), stride=(1,1)
Decoder	Deconv4	(128x64x16)	(64x128x16)	kernel_size=(3,1), stride=(2,1)
Decoder	Deconv5	(64x128x16)	(32x128x32)	kernel_size=(1,5), stride=(1,2)
Decoder	Resid3	(32x128x32)	(32x128x32)	kernel_size=(3,3), stride=(1,1)
Decoder	Deconv6	(32x128x32)	(16x256x32)	kernel_size=(5,1), stride=(2,1)
Decoder	Resid4	(16x256x32)	(16x256x32)	kernel_size=(3,3), stride=(1,1)
Decoder	Deconv7	(16x256x32)	(1x512x64)	kernel_size=(3,3), stride=(2,2)

References

[1] RAVE
[2] How Good are Variational Autoencoders at Transfer Learning?
[3] Hsu W., et al, "Learning Latent Representations for Speech Generation and Transformation"
[4] β-Annealed Variational Autoencoder for glitches
[5] Jukebox: A Generative Model for Music
[6] MusicVAE: Creating a palette for musical scores with machine learning.
[7] Medley-solos-DB dataset
[8] Vahdat A., Kautz J.

Reconstructing Resonance: VAEs for Instrument Audio Generation

Outline