CosmicAC Logo

Continued Pre-training

Continued pre-training is a technique for extending a base language model's knowledge by training it further on a new corpus of data. CosmicAC provides this as a managed job type, handling GPU provisioning and checkpoint storage so you can focus on the training data and configuration.


What Continued Pre-training Is

A base language model is trained on a large, general-purpose corpus using a next-token prediction objective. This process gives the model broad linguistic capability and general world knowledge, but it does not make the model an expert in any specific domain.

Continued pre-training (CPT) picks up where pre-training left off. You train the model further , using the same next-token prediction objective , on a domain-specific corpus. The model updates its weights to better represent the statistical patterns in your data. The result is a model with the same general capability as the base model, but with a shifted knowledge distribution that reflects your domain.

Because CPT uses the same self-supervised objective as the original pre-training (predicting the next token), it does not require labeled examples. Your training data is raw text: technical documentation, scientific literature, codebases, internal knowledge bases, or any other domain corpus.


How Continued Pre-training Differs from Fine-tuning

The terms continued pre-training and fine-tuning are sometimes used interchangeably, but they describe different things.

Continued pre-training trains on unlabeled text to shift the model's knowledge. You are not teaching the model a new behavior or task, you are extending what it knows. The training objective is the same as original pre-training. Predict the next token across a large corpus.

Fine-tuning (also called supervised fine-tuning, or SFT) trains on labeled examples, typically instruction-response pairs or conversation transcripts, to shape how the model behaves. The objective is to make the model produce specific types of outputs given specific types of inputs.

In practice, these two techniques are often used in sequence:

  1. Start with a general-purpose base model.
  2. Apply continued pre-training on your domain corpus to ground the model in domain knowledge.
  3. Apply fine-tuning on labeled examples to teach it how to respond in the context of that domain.

CPT is the knowledge step. Fine-tuning is the behavior step.


When to Use Continued Pre-training

CPT is appropriate when:

  • The base model lacks meaningful knowledge of your domain because the domain is specialized, uses domain-specific vocabulary, or was underrepresented in the original pre-training corpus.
  • You have a substantial corpus of domain text (typically hundreds of millions to billions of tokens for meaningful knowledge transfer, though smaller datasets can still help on narrow domains).
  • You want to preserve the model's general capability while adding domain knowledge, rather than replacing its behavior entirely.

CPT is less appropriate when:

  • The base model already understands your domain reasonably well and you primarily want to change how it responds. In that case, fine-tuning alone is more efficient.
  • Your dataset is small. CPT on a small corpus can degrade general capability without meaningfully improving domain performance.
  • Your goal is a narrow task (classification, extraction, summarization to a specific format). Supervised fine-tuning on task-specific examples will outperform CPT for task performance.

Checkpoints

CosmicAC saves checkpoints at intervals during training. A checkpoint is a snapshot of the model's weights at a specific point in the training run. Checkpoints let you:

  • Resume a training run if it is interrupted.
  • Evaluate model quality at different stages of training and select the best checkpoint rather than using the final weights.
  • Roll back to an earlier state if later training steps degrade performance.

Checkpoint frequency is configured when you set up the job.


On this page