Machine Learning

From NoskeWiki
Jump to navigation Jump to search


This page is a child of: Artificial Intelligence

Machine Learning (ML) is a subset of AI-focused specifically on the idea that machines can learn from data, identify patterns, and make decisions with minimal human intervention. ML is the method by which we achieve many AI outcomes. It involves algorithms that learn from data and improve over time. The more data these algorithms are exposed to, the better they get at making predictions or decisions.

This page is supposed to turn into a glossary to help me come up to speed with a job in Machine Learning / AI in 2023.

Machine Learning Glossary

Here we've focussed on the acronyms for quick lookup.


  • NLP (Natural Language Processing) is an algorithm for evaluating the quality of text that has been machine-translated from one natural language to another. Here's an overview of the text preprocessing steps in NLP in their typical order (video):
    1. Tokenization: This first step involves breaking down text into individual units called tokens, which can be words, numbers, or punctuation marks. It helps in structuring the text for further analysis.
    2. Segmentation: In languages where words are not separated by spaces, or in cases of speech-to-text where sentences need to be delineated, segmentation is crucial. It involves dividing a large continuous piece of text into distinct units, like sentences or words.
    3. Cleaning and Normalization: This step involves removing unnecessary characters, like HTML tags, special characters, and punctuation, and standardizing the text (e.g., converting to lowercase) to ensure uniformity.
    4. Stop Words Removal: Stop words (common words like 'the', 'is', etc.) are usually removed as they frequently appear in the language and typically don't contribute much to the meaning of a sentence for many NLP tasks.
    5. Stemming: This process reduces words to their word stem or root form, which can sometimes produce non-actual words. It's a way to treat different forms of the same word as equivalent, though it's a cruder approach than lemmatization. For instance, the stem of the words "waiting", "waited", and "waits" is "wait".
    6. Lemmatization: This step involves transforming words into their base or root form but ensures that this root word belongs to the language. It's more accurate than stemming as it considers context and uses a vocabulary and morphological analysis of words.
    7. Part-of-Speech Tagging: Each word is tagged with a part of speech (like noun, verb, adjective, etc.) based on its context, which helps in understanding the grammatical structure of sentences.
    8. Named Entity Recognition (NER): This process involves identifying and classifying named entities (like names of people, places, and organizations) present in the text into predefined categories.
    9. Parsing/Syntactic Analysis: This involves analyzing the grammatical structure of a sentence and establishing relationships between “head” words and words that modify those heads.

  • RL (Reinforcement Learning) is a type of ML where an agent learns to make decisions by performing actions in an environment to achieve maximum cumulative reward. In RL algorithms an agent learns to make decisions by performing actions in an environment to achieve some goal. The fundamental components include the:
    (1) Agent: The learner or decision maker.
    (2) Environment: Everything the agent interacts with.
    (3) Actions: What the agent can do. (according to the RL policy)
    (4) State: The current situation returned by the environment.
    (5) Reward: Feedback from the environment in response to an action.
    Here are some RL algorithms and concepts:
    • PPO (Proximal Policy Optimization): Streamlines policy gradient methods by using a clipped objective to ensure small, stable updates during training. Is the default RL algorithm used by OpenAI and said to strike a balance between performance and comprehension - simplicity, stability, and sample efficiency.
    • A2C (Advantage Actor-Critic): Combines policy gradient (Actor) and value function (Critic), using the advantage function for more efficient updates.
    • TRPO (Trust Region Policy Optimization): Optimizes policies within a defined 'trust region' to make significant yet constrained updates, ideal for large, nonlinear policies.
    • NLPO: Not a standard acronym in RL; can refer to applying RL optimization techniques in Natural Language Processing tasks.
    • Policy: In the context of RL, a "policy" is a fundamental concept that defines the behaviour of an agent. Specifically, a policy is a strategy or a rule that the agent follows to decide what action to take in a given state of the environment. The policy maps states of the environment to actions the agent should take when in those states. It can be deterministic, where a specific state always results in the same action, or stochastic, where actions are chosen based on a probability distribution. The goal in many RL scenarios is to learn an optimal policy that maximizes the cumulative reward the agent receives over time.

  • RLHF (Reinforcement Learning from Human Feedback) is a machine learning approach where a model is trained to perform tasks by incorporating feedback from humans, helping it to better understand and align with human values and preferences. Steps might be (1) retraining a language model (LM), (2) gathering data and training a reward model, and (3) fine-tuning the LM with reinforcement learning.
    • RLHF Costs: When deploying a system using RLHF, gathering the human preference data is quite expensive due to the direct integration of other human workers outside the training loop. RLHF performance is only as good as the quality of its human annotations, which take on two varieties: human-generated text, such as fine-tuning the initial LM in InstructGPT, and labels of human preferences between model outputs. This can be time-expensive, but the greater monetary cost for a system like ChatGPT is probably computational resources, namely machine cycles because billions of parameters and vast datasets require high-powered GPUs or TPUs for weeks or months for each iteration.

  • RM (Reward Model). Reward Modeling in RLHF typically involves the following steps: (1) Collect Feedback: Gather human feedback on the agent's actions. This feedback tells the agent what humans prefer or consider better in different situations. (2) Train Reward Model: Use the human feedback to train a model. This model learns to understand and quantify what actions are good or bad based on the feedback. (3) Train the Agent: The agent is then trained using this reward model. It learns to perform actions that the model predicts as good, aligning its behaviour with what humans consider desirable.


  • ELO (Elo rating system) is a method for calculating the relative skill levels of players in zero-sum games (a competitive theory where one participant's gain (or loss) is exactly balanced by the losses (or gains) of the other participants) such as chess. Developed by Arpad Elo, this system updates a player's rating based on their game results, taking into account the strength of their opponents. A win against a higher-rated player results in a greater increase in rating than a win against a lower-rated player. Conversely, losing to a lower-rated player results in a larger rating decrease. Elo ratings provide a standardized way to compare the skill levels of players in competitive gaming... An Elo system can be used to compare generated text from two competing language models then normalized into a scalar reward signal for training.
  • KL (Kullback–Leibler) also called relative entropy and I-divergence, is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. The KL divergence is a key concept in many fields such as machine learning, statistics, and data science. Some key points are:
    1. Not a True Distance: Unlike Euclidean distance or other metrics, KL divergence is not symmetric. This means KL(P||Q) is not the same as KL(Q||P). It's a measure of how one probability distribution diverges from a second, but not the other way around.
    2. Non-Negativity: The KL divergence is always non-negative, KL(P||Q)≥0, and is zero if and only if P and Q are the same distribution in all aspects.
    3. Formula: For discrete probability distributions P and Q, the KL divergence is defined as: Kl formula.png where x ranges over all possible events. For continuous distributions, the sum is replaced by an integral.
    4. Interpretation: KL divergence can be thought of as a measure of the information lost when Q is used to approximate P. In other words, it tells us how well a probabilistic model Q approximates the true distribution P.
  • LRA (Low-Rank Adaptation) or LoRA is a technique used in the field of machine learning, particularly in the context of neural network models. The concept revolves around modifying a pre-trained neural network in a way that is both efficient and effective, especially when adapting to new tasks or datasets. The basic idea is to design a low-rank matrix that is then added to the original matrix.


  • BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text that has been machine-translated from one natural language to another. "the closer a machine translation is to a professional human translation, the better it is" ... score 0 to 1, but 1 is impossible to achieve as most translations are lossy.
  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
  • PPO (Proximal Policy Optimization) is a popular Reinforcement Learning algorithm known for its effectiveness and simplicity. It optimizes a policy by taking small steps to adjust actions, focusing on balancing exploration (trying new actions) and exploitation (using known good actions), and ensures stable and efficient learning by limiting the extent of policy updates in each iteration. Is the default RL algorithm used by OpenAI.


  • LLaMA is a foundational language model developed by Meta AI (formerly known as Facebook AI). It represents Meta's significant contribution to the field of large-scale language models, similar to OpenAI's GPT series. LLaMA is designed to serve as a robust, versatile backbone for a variety of natural language processing tasks. Key characteristics of LLaMA include: Scalability and Efficiency (high performance and efficiency, capable of being scaled to handle various sizes of datasets and computational constraints), General-Purpose Design (simple text generation to more complex tasks like translation, summarization, and question-answering), Research and Collaboration Focus (Meta AI often emphasizes the research potential of its models and the benefits of collaboration in the AI community), Ethical and Safe AI Development (focus on ensuring ethical usage and mitigating biases). As of Nov 2023 Meta/Facebook doesn't have a main "chatbot" based on LLaMA or otherwise..
  • ChatGPT an AI language model developed by OpenAI. It's based on the GPT (Generative Pre-trained Transformer) architecture, which is designed to generate human-like text based on the input it receives. ChatGPT has been trained on a diverse range of internet text, enabling it to respond to various prompts, answer questions, write essays, compose emails, and even create creative content like stories and poems. ChatGPT has gone through many iteration, starting with InstructGPT then GPT (released ~June 2018), GPT-2 (released ~Nov 2019), GPT-3 (launched June 2020) and GPT-4 (launched April 2023) Each version of GPT has been a step forward in natural language processing, showcasing improvements in understanding context, generating more coherent and contextually relevant text, and handling a wider range of language tasks..
  • Claude an AI chatbot by Anthropic, created by former researchers involved in OpenAI's GPT-2 and GPT-3 model development. Similar to ChatGPT, Claude uses a messaging interface where users can submit questions or requests and receive highly detailed and relevant responses. Initially available in closed beta through a Slack integration, Claude is now accessible via a website Claude is regularly trained on up-to-date information, and can read up to 75,000 words at a time. This means it can read a short book and answer questions about it.

Code Repos

  • rl4lms (Reinforcement Learning for Language Models) a library by the Allen Institute for AI.
  • trl (Transformers Reinforcement Learning) - is a full stack huggingface library where we provide a set of tools to train transformer language models and stable diffusion models with Reinforcement Learning, from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization (PPO) step. The library is built on top of the transformers library by 🤗Hugging Face.
  • trlX (Transformers Reinforcement Learning) a distributed training framework designed from the ground up to focus on fine-tuning large language models with reinforcement learning using either a provided reward function or a reward-labeled dataset, built by CarperAI. Training support for 🤗 Hugging Face models is provided by Accelerate-backed trainers, allowing users to fine-tune causal and T5-based language models of up to 20B parameters, such as facebook/opt-6.7b, EleutherAI/gpt-neox-20b, and google/flan-t5-xxl.
  • peft (Parameter-Efficient Fine-Tuning) a library of methods (such as LORA, Prefix Tuning, P-Tuning etc) to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. Fine-tuning large-scale PLMs is often prohibitively costly. In this regard, PEFT methods only fine-tune a small number of (extra) model parameters, thereby greatly decreasing the computational and storage costs. A common strategy might be to Low-Rank Adaptation (LoRA) on a model loaded in 8-bit.


See Also