预备调查清单
待核查内容1: Jack Rae 的教育背景与职业经历的真实性,包括其姓名及在AI领域的存在。
待核查内容2: 职业经历时间线:DeepMind(约2015–2023),OpenAI(2023–2024),Google DeepMind(2024–2025),Meta(2025年6月起)。
待核查内容3: 在 DeepMind 的核心贡献:主导 Gopher 项目,并“首次将人类考试水平差距缩小一半”。
待核查内容4: 在 DeepMind 的核心贡献:提出“Chinchilla Scaling Laws”,且该研究为“2022年NeurIPS最佳论文”。
待核查内容5: 在 DeepMind 的核心贡献:担任 Gemini 1.0/2.0 的预训练负责人。
待核查内容6: 在 OpenAI 的核心贡献:为“GPT-4 核心开发者之一”。
待核查内容7: 在 OpenAI 的核心贡献:提出“压缩即智能”理论。
待核查内容8: 在 Meta 的职位:被扎克伯格亲自挖角,成为“超级智能实验室(Meta Superintelligence Labs)”核心成员。
待核查内容9: 总结性陈述:“Jack Rae 是目前极少数横跨 DeepMind、OpenAI、Meta 三大顶尖AI机构的技术领袖之一”。
行动计划
让我首先搜索 Jack Rae 的基本信息,确认他在 AI 领域的存在。
Search
Jack Rae AI researcher DeepMind Google Scholar我需要继续搜索更多关于Jack Rae的信息,特别是他的教育背景和职业经历。
Search
Jack Rae AI researcher education background Bristol UCL CMU我需要继续搜索Jack Rae的学术背景和职业经历,特别是与Chinchilla和Gopher相关的工作。
Search
Jack Rae Chinchilla Gopher NeurIPS best paper 2022我需要继续搜索Jack Rae在OpenAI的工作经历和他提出的"压缩即智能"理论。
Search
Jack Rae OpenAI compression is intelligence theory已核查结论清单
已核查内容1:Jack Rae 的职业经历可以分为以下几个阶段:...DeepMind(约2015–2023)...OpenAI(2023–2024)...Google DeepMind(2024–2025)...Meta(2025年6月起)
已核查内容2:博士:英国伦敦大学学院(UCL),计算机科学方向
已核查内容3:主导项目:...Chinchilla:提出“Chinchilla Scaling Laws”
已核查内容4:Chinchilla...2022年NeurIPS最佳论文
已核查内容5:Gemini 系列:Gemini 1.0/2.0 的预训练负责人,Gemini 2.5 的推理技术负责人
已核查内容6:背景:被扎克伯格亲自挖角,成为Meta组建AGI团队的首批关键人物之一
待补充核查清单
待核查内容1:GPT-4 核心开发者之一,参与预训练与推理系统设计
待核查内容2:主导项目:Gopher:2800亿参数的语言模型,首次将人类考试水平差距缩小一半
待核查内容3:提出“压缩即智能”理论:认为大模型训练本质是对数据的无损压缩,压缩越高效,模型越智能
待核查内容4:本科:英国布里斯托大学(University of Bristol);硕士:美国卡内基梅隆大学(CMU)
待核查内容5:职位:超级智能实验室(Meta Superintelligence Labs)核心成员
行动计划
让我首先搜索关于Jack Rae在GPT-4项目中的角色。
Search
Jack Rae GPT-4 OpenAI technical report contributors我需要继续搜索Jack Rae在Gopher项目中的角色和贡献。
Search
Jack Rae Gopher language model DeepMind paper我需要查找更多关于Jack Rae在Gopher项目中的角色,以及Gopher模型的具体成就。
Search
Gopher language model human reading comprehension gap halved我需要查找更多关于Jack Rae在OpenAI的"压缩即智能"理论的信息。
Search
Jack Rae OpenAI compression is intelligence theory Stanford MLSys Seminar已核查结论清单
已核查内容1:Jack Rae 的职业经历可以分为以下几个阶段:...DeepMind(约2015–2023)...OpenAI(2023–2024)...Google DeepMind(2024–2025)...Meta(2025年6月起)
已核查内容2:博士:英国伦敦大学学院(UCL),计算机科学方向
已核查内容3:Chinchilla:提出“Chinchilla Scaling Laws”...2022年NeurIPS最佳论文
已核查内容4:Gemini 系列:Gemini 1.0/2.0 的预训练负责人,Gemini 2.5 的推理技术负责人
已核查内容5:背景:被扎克伯格亲自挖角,成为Meta组建AGI团队的首批关键人物之一
待补充核查清单
待核查内容1:GPT-4 核心开发者之一,参与预训练与推理系统设计
https://arxiv.org/abs/2303.08774
)及官方贡献者页面(https://openai.com/contributions/gpt-4
)。待核查内容2:主导项目:Gopher:2800亿参数的语言模型,首次将人类考试水平差距缩小一半
待核查内容3:提出“压缩即智能”理论:认为大模型训练本质是对数据的无损压缩,压缩越高效,模型越智能
待核查内容4:本科:英国布里斯托大学(University of Bristol);硕士:美国卡内基梅隆大学(CMU)
待核查内容5:职位:超级智能实验室(Meta Superintelligence Labs)核心成员
行动计划
让我首先检查GPT-4技术报告,查找Jack Rae的贡献。
GPT-4 Technical Report
OpenAI*∗*
Abstract
We report the development of GPT-4, a large-scale, multimodal model which can
accept image and text inputs and produce text outputs. While less capable than
humans in many real-world scenarios, GPT-4 exhibits human-level performance
on various professional and academic benchmarks, including passing a simulated
bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-
based model pre-trained to predict the next token in a document. The post-training
alignment process results in improved performance on measures of factuality and
adherence to desired behavior. A core component of this project was developing
infrastructure and optimization methods that behave predictably across a wide
range of scales. This allowed us to accurately predict some aspects of GPT-4’s
performance based on models trained with no more than 1/1,000th the compute of
GPT-4.
1
Introduction
This technical report presents GPT-4, a large multimodal model capable of processing image and
text inputs and producing text outputs. Such models are an important area of study as they have the
potential to be used in a wide range of applications, such as dialogue systems, text summarization,
and machine translation. As such, they have been the subject of substantial interest and progress in
recent years [1–34].
One of the main goals of developing such models is to improve their ability to understand and generate
natural language text, particularly in more complex and nuanced scenarios. To test its capabilities
in such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In
these evaluations it performs quite well and often outscores the vast majority of human test takers.
For example, on a simulated bar exam, GPT-4 achieves a score that falls in the top 10% of test takers.
This contrasts with GPT-3.5, which scores in the bottom 10%.
On a suite of traditional NLP benchmarks, GPT-4 outperforms both previous large language models
and most state-of-the-art systems (which often have benchmark-specific training or hand-engineering).
On the MMLU benchmark [35, 36], an English-language suite of multiple-choice questions covering
57 subjects, GPT-4 not only outperforms existing models by a considerable margin in English, but
also demonstrates strong performance in other languages. On translated variants of MMLU, GPT-4
surpasses the English-language state-of-the-art in 24 of 26 languages considered. We discuss these
model capability results, as well as model safety improvements and results, in more detail in later
sections.
This report also discusses a key challenge of the project, developing deep learning infrastructure and
optimization methods that behave predictably across a wide range of scales. This allowed us to make
predictions about the expected performance of GPT-4 (based on small runs trained in similar ways)
that were tested against the final run to increase confidence in our training.
Despite its capabilities, GPT-4 has similar limitations to earlier GPT models [1, 37, 38]: it is not fully
reliable (e.g. can suffer from “hallucinations”), has a limited context window, and does not learn
*∗*Please cite this work as “OpenAI (2023)". Full authorship contribution statements appear at the end of the
document. Correspondence regarding this technical report can be sent togpt4-report@openai.com
arXiv:submit/4812508 [cs.CL] 27 Mar 2023
from experience. Care should be taken when using the outputs of GPT-4, particularly in contexts
where reliability is important.
GPT-4’s capabilities and limitations create significant and novel safety challenges, and we believe
careful study of these challenges is an important area of research given the potential societal impact.
This report includes an extensive system card (after the Appendix) describing some of the risks we
foresee around bias, disinformation, over-reliance, privacy, cybersecurity, proliferation, and more.
It also describes interventions we made to mitigate potential harms from the deployment of GPT-4,
including adversarial testing with domain experts, and a model-assisted safety pipeline.
2
Scope and Limitations of this Technical Report
This report focuses on the capabilities, limitations, and safety properties of GPT-4. GPT-4 is a
Transformer-style model [39] pre-trained to predict the next token in a document, using both publicly
available data (such as internet data) and data licensed from third-party providers. The model was
then fine-tuned using Reinforcement Learning from Human Feedback (RLHF) [40]. Given both
the competitive landscape and the safety implications of large-scale models like GPT-4, this report
contains no further details about the architecture (including model size), hardware, training compute,
dataset construction, training method, or similar.
We are committed to independent auditing of our technologies, and shared some initial steps and
ideas in this area in the system card accompanying this release.2 We plan to make further technical
details available to additional third parties who can advise us on how to weigh the competitive and
safety considerations above against the scientific value of further transparency.
3
Predictable Scaling
A large focus of the GPT-4 project was building a deep learning stack that scales predictably. The
primary reason is that for very large training runs like GPT-4, it is not feasible to do extensive
model-specific tuning. To address this, we developed infrastructure and optimization methods that
have very predictable behavior across multiple scales. These improvements allowed us to reliably
predict some aspects of the performance of GPT-4 from smaller models trained using 1*,* 000*×* –
10*,* 000*×* less compute.
3.1
Loss Prediction
The final loss of properly-trained large language models is thought to be well approximated by power
laws in the amount of compute used to train the model [41, 42, 2, 14, 15].
To verify the scalability of our optimization infrastructure, we predicted GPT-4’s final loss on our
internal codebase (not part of the training set) by fitting a scaling law with an irreducible loss term
(as in Henighan et al. [15]): L(C) = aC**b + c, from models trained using the same methodology
but using at most 10,000x less compute than GPT-4. This prediction was made shortly after the run
started, without use of any partial results. The fitted scaling law predicted GPT-4’s final loss with
high accuracy (Figure 1).
3.2
Scaling of Capabilities on HumanEval
Having a sense of the capabilities of a model before training can improve decisions around alignment,
safety, and deployment. In addition to predicting final loss, we developed methodology to predict
more interpretable metrics of capability. One such metric is pass rate on the HumanEval dataset [43],
which measures the ability to synthesize Python functions of varying complexity. We successfully
predicted the pass rate on a subset of the HumanEval dataset by extrapolating from models trained
with at most 1*,* 000*×* less compute (Figure 2).
For an individual problem in HumanEval, performance may occasionally worsen with scale. Despite
these challenges, we find an approximate power law relationship −EP [log(pass_rate(C))] = *α∗C−*k
2In addition to the accompanying system card, OpenAI will soon publish additional thoughts on the social
and economic implications of AI systems, including the need for effective regulation.
2
Observed
Prediction
gpt-4
100p
10n
1µ
100µ
0.01
1
Compute
1.0
2.0
3.0
4.0
5.0
6.0
Bits per word
OpenAI codebase next word prediction
Figure 1. Performance of GPT-4 and smaller models. The metric is final loss on a dataset derived
from our internal codebase. This is a convenient, large dataset of code tokens which is not contained in
the training set. We chose to look at loss because it tends to be less noisy than other measures across
different amounts of training compute. A power law fit to the smaller models (excluding GPT-4) is
shown as the dotted line; this fit accurately predicts GPT-4’s final loss. The x-axis is training compute
normalized so that GPT-4 is 1.
Observed
Prediction
gpt-4
1µ
10µ
100µ
0.001
0.01
0.1
1
Compute
0
1
2
3
4
5
– Mean Log Pass Rate
Capability prediction on 23 coding problems
Figure 2. Performance of GPT-4 and smaller models. The metric is mean log pass rate on a subset of
the HumanEval dataset. A power law fit to the smaller models (excluding GPT-4) is shown as the dotted
line; this fit accurately predicts GPT-4’s performance. The x-axis is training compute normalized so that
GPT-4 is 1.
3
where k and α are positive constants, and P is a subset of problems in the dataset. We hypothesize
that this relationship holds for all problems in this dataset. In practice, very low pass rates are difficult
or impossible to estimate, so we restrict to problems P and models M such that given some large
sample budget, every problem is solved at least once by every model.
We registered predictions for GPT-4’s performance on HumanEval before training completed, using
only information available prior to training. All but the 15 hardest HumanEval problems were split
into 6 difficulty buckets based on the performance of smaller models. The results on the 3rd easiest
bucket are shown in Figure 2, showing that the resulting predictions were very accurate for this
subset of HumanEval problems where we can accurately estimate log(pass_rate) for several smaller
models. Predictions on the other five buckets performed almost as well, the main exception being
GPT-4 underperforming our predictions on the easiest bucket.
Certain capabilities remain hard to predict. For example, the Inverse Scaling Prize [44] proposed
several tasks for which model performance decreases as a function of scale. Similarly to a recent
result by Wei et al. [45], we find that GPT-4 reverses this trend, as shown on one of the tasks called
Hindsight Neglect [46] in Figure 3.
ada
babbage
curie
gpt-3.5
gpt-4
Model
0
50
100
Accuracy
Inverse scaling prize, hindsight neglect
Figure 3. Performance of GPT-4 and smaller models on the Hindsight Neglect task. Accuracy is shown
on the y-axis, higher is better. ada, babbage, and curie refer to models available via the OpenAI API [47].
We believe that accurately predicting future capabilities is important for safety. Going forward we
plan to refine these methods and register performance predictions across various capabilities before
large model training begins, and we hope this becomes a common goal in the field.
4
Capabilities
We tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally
designed for humans.4 We did no specific training for these exams. A minority of the problems in the
exams were seen by the model during training; for each exam we run a variant with these questions
removed and report the lower score of the two. We believe the results to be representative. For further
details on contamination (methodology and per-exam statistics), see Appendix C.
Exams were sourced from publicly-available materials. Exam questions included both multiple-
choice and free-response questions; we designed separate prompts for each format, and images were
included in the input for questions which required it. The evaluation setup was designed based
on performance on a validation set of exams, and we report final results on held-out test exams.
Overall scores were determined by combining multiple-choice and free-response question scores
using publicly available methodologies for each exam. We estimate and report the percentile each
overall score corresponds to. See Appendix A for further details on the exam evaluation methodology.
3For AMC 10 and AMC 12 2022 exams, the human percentiles are not yet published, so the reported numbers
are extrapolated and likely have wide uncertainty. See Appendix A.5.
4We used the post-trained RLHF model for these exams.
4
Exam
GPT-4
GPT-4 (no vision)
GPT-3.5
Uniform Bar Exam (MBE+MEE+MPT)
298 / 400 (~90th)
298 / 400 (~90th)
213 / 400 (~10th)
LSAT
163 (~88th)
161 (~83rd)
149 (~40th)
SAT Evidence-Based Reading & Writing
710 / 800 (~93rd)
710 / 800 (~93rd)
670 / 800 (~87th)
SAT Math
700 / 800 (~89th)
690 / 800 (~89th)
590 / 800 (~70th)
Graduate Record Examination (GRE) Quantitative
163 / 170 (~80th)
157 / 170 (~62nd)
147 / 170 (~25th)
Graduate Record Examination (GRE) Verbal
169 / 170 (~99th)
165 / 170 (~96th)
154 / 170 (~63rd)
Graduate Record Examination (GRE) Writing
4 / 6 (~54th)
4 / 6 (~54th)
4 / 6 (~54th)
USABO Semifinal Exam 2020
87 / 150 (99th - 100th)
87 / 150 (99th - 100th)
43 / 150 (31st - 33rd)
USNCO Local Section Exam 2022
36 / 60
38 / 60
24 / 60
Medical Knowledge Self-Assessment Program
75 %
75 %
53 %
Codeforces Rating
392 (below 5th)
392 (below 5th)
260 (below 5th)
AP Art History
5 (86th - 100th)
5 (86th - 100th)
5 (86th - 100th)
AP Biology
5 (85th - 100th)
5 (85th - 100th)
4 (62nd - 85th)
AP Calculus BC
4 (43rd - 59th)
4 (43rd - 59th)
1 (0th - 7th)
AP Chemistry
4 (71st - 88th)
4 (71st - 88th)
2 (22nd - 46th)
AP English Language and Composition
2 (14th - 44th)
2 (14th - 44th)
2 (14th - 44th)
AP English Literature and Composition
2 (8th - 22nd)
2 (8th - 22nd)
2 (8th - 22nd)
AP Environmental Science
5 (91st - 100th)
5 (91st - 100th)
5 (91st - 100th)
AP Macroeconomics
5 (84th - 100th)
5 (84th - 100th)
2 (33rd - 48th)
AP Microeconomics
5 (82nd - 100th)
4 (60th - 82nd)
4 (60th - 82nd)
AP Physics 2
4 (66th - 84th)
4 (66th - 84th)
3 (30th - 66th)
AP Psychology
5 (83rd - 100th)
5 (83rd - 100th)
5 (83rd - 100th)
AP Statistics
5 (85th - 100th)
5 (85th - 100th)
3 (40th - 63rd)
AP US Government
5 (88th - 100th)
5 (88th - 100th)
4 (77th - 88th)
AP US History
5 (89th - 100th)
4 (74th - 89th)
4 (74th - 89th)
AP World History
4 (65th - 87th)
4 (65th - 87th)
4 (65th - 87th)
AMC 103
30 / 150 (6th - 12th)
36 / 150 (10th - 19th)
36 / 150 (10th - 19th)
AMC 123
60 / 150 (45th - 66th)
48 / 150 (19th - 40th)
30 / 150 (4th - 8th)
Introductory Sommelier (theory knowledge)
92 %
92 %
80 %
Certified Sommelier (theory knowledge)
86 %
86 %
58 %
Advanced Sommelier (theory knowledge)
77 %
77 %
46 %
Leetcode (easy)
31 / 41
31 / 41
12 / 41
Leetcode (medium)
21 / 80
21 / 80
8 / 80
Leetcode (hard)
3 / 45
3 / 45
0 / 45
Table 1. GPT performance on academic and professional exams. In each case, we simulate the
conditions and scoring of the real exam. We report GPT-4’s final score graded according to exam-
specific rubrics, as well as the percentile of test-takers achieving GPT-4’s score.
5
AP Calculus BC
AMC 12
Codeforces Rating
AP English Literature
AMC 10
Uniform Bar Exam
AP English Language
AP Chemistry
GRE Quantitative
AP Physics 2
USABO Semifinal 2020
AP Macroeconomics
AP Statistics
LSAT
GRE Writing
AP Microeconomics
AP Biology
GRE Verbal
AP World History
SAT Math
AP US History
AP US Government
AP Psychology
AP Art History
SAT EBRW
AP Environmental Science
Exam
0%
20%
40%
60%
80%
100%
Estimated percentile lower bound (among test takers)
Exam results (ordered by GPT-3.5 performance)
gpt-4
gpt-4 (no vision)
gpt3.5
Figure 4. GPT performance on academic and professional exams. In each case, we simulate the
conditions and scoring of the real exam. Exams are ordered from low to high based on GPT-3.5
performance. GPT-4 outperforms GPT-3.5 on most exams tested. To be conservative we report the
lower end of the range of percentiles, but this creates some artifacts on the AP exams which have very
wide scoring bins. For example although GPT-4 attains the highest possible score on AP Biology (5/5),
this is only shown in the plot as 85th percentile because 15 percent of test-takers achieve that score.
GPT-4 exhibits human-level performance on the majority of these professional and academic exams.
Notably, it passes a simulated version of the Uniform Bar Examination with a score in the top 10% of
test takers (Table 1, Figure 4).
The model’s capabilities on exams appear to stem primarily from the pre-training process and are not
significantly affected by RLHF. On multiple choice questions, both the base GPT-4 model and the
RLHF model perform equally well on average across the exams we tested (see Appendix B).
We also evaluated the pre-trained base GPT-4 model on traditional benchmarks designed for evaluating
language models. For each benchmark we report, we ran contamination checks for test data appearing
in the training set (see Appendix D for full details on per-benchmark contamination).5 We used
few-shot prompting [1] for all benchmarks when evaluating GPT-4.6
GPT-4 considerably outperforms existing language models, as well as previously state-of-the-art
(SOTA) systems which often have benchmark-specific crafting or additional training protocols
(Table 2).
5During our contamination check we discovered that portions of BIG-bench [48] were inadvertently mixed
into the training set, and we excluded it from our reported results.
6For GSM-8K, we include part of the training set in GPT-4’s pre-training mix (see Appendix E for details).
We use chain-of-thought prompting [11] when evaluating.
6
GPT-4
GPT-3.5
LM SOTA
SOTA
Evaluated
few-shot
Evaluated
few-shot
Best external LM
evaluated few-shot
Best external model (incl.
benchmark-specific tuning)
MMLU [49]
86.4%
70.0%
70.7%
75.2%
Multiple-choice questions in 57
subjects (professional & academic)
5-shot
5-shot
5-shot
U-PaLM [50]
5-shot Flan-PaLM [51]
HellaSwag [52]
95.3%
85.5%
84.2%
85.6
Commonsense reasoning around
everyday events
10-shot
10-shot
LLaMA (validation
set) [28]
ALUM [53]
AI2 Reasoning
Challenge (ARC) [54]
96.3%
85.2%
85.2%
86.5%
Grade-school multiple choice
science questions. Challenge-set.
25-shot
25-shot
8-shot PaLM [55]
ST-MOE [18]
WinoGrande [56]
87.5%
81.6%
85.1%
85.1%
Commonsense reasoning around
pronoun resolution
5-shot
5-shot
5-shot PaLM [3]
5-shot PaLM [3]
HumanEval [43]
67.0%
48.1%
26.2%
65.8%
Python coding tasks
0-shot
0-shot
0-shot PaLM [3]
CodeT + GPT-3.5 [57]
DROP [58] (F1 score)
80.9
64.1
70.8
88.4
Reading comprehension &
arithmetic.
3-shot
3-shot
1-shot PaLM [3]
QDGAT [59]
GSM-8K [60]
92.0%∗
57.1%
58.8%
87.3%
Grade-school mathematics
questions
5-shot
chain-of-thought
5-shot
8-shot Minerva [61]
Chinchilla +
SFT+ORM-RL, ORM
reranking [62]
Table 2. Performance of GPT-4 on academic benchmarks. We compare GPT-4 alongside the best
SOTA (with benchmark-specific training) and the best SOTA for an LM evaluated few-shot. GPT-4
outperforms existing LMs on all benchmarks, and beats SOTA with benchmark-specific training on all
datasets except DROP. For each task we report GPT-4’s performance along with the few-shot method
used to evaluate. For GSM-8K, we included part of the training set in the GPT-4 pre-training mix
(see Appendix E), and we use chain-of-thought prompting [11] when evaluating. For multiple-choice
questions, we present all answers (ABCD) to the model and ask it to choose the letter of the answer,
similarly to how a human would solve such a problem.
Many existing ML benchmarks are written in English. To gain an initial understanding of GPT-4’s
capabilities in other languages, we translated the MMLU benchmark [35, 36] – a suite of multiple-
choice problems spanning 57 subjects – into a variety of languages using ...
我需要继续查找GPT-4技术报告中关于贡献者的信息,我将检查报告的末尾部分。
Search
GPT-4 technical report contributors authors Jack Rae OpenAI我需要查找更多关于Jack Rae在Gopher项目中的角色和贡献,以及他是否参与了GPT-4的开发。
Search
Jack Rae Gopher DeepMind author contribution我需要查找更多关于Jack Rae在"压缩即智能"理论方面的贡献,以及他是否参与了GPT-4的开发。
Search
Jack Rae compression is intelligence theory Stanford MLSys Seminar已核查结论清单
已核查内容1:Jack Rae 的职业经历可以分为以下几个阶段:...DeepMind(约2015–2023)...OpenAI(2023–2024)...Google DeepMind(2024–2025)...Meta(2025年6月起)
已核查内容2:博士:英国伦敦大学学院(UCL),计算机科学方向
已核查内容3:Chinchilla:提出“Chinchilla Scaling Laws”...2022年NeurIPS最佳论文
已核查内容4:Gemini 系列:Gemini 1.0/2.0 的预训练负责人,Gemini 2.5 的推理技术负责人
已核查内容5:背景:被扎克伯格亲自挖角,成为Meta组建AGI团队的首批关键人物之一
待补充核查清单
待核查内容1:GPT-4 核心开发者之一,参与预训练与推理系统设计
https://arxiv.org/abs/2303.08774
)。https://openai.com/contributions/gpt-4
)。待核查内容2:主导项目:Gopher:2800亿参数的语言模型,首次将人类考试水平差距缩小一半
待核查内容3:提出“压缩即智能”理论:认为大模型训练本质是对数据的无损压缩,压缩越高效,模型越智能
待核查内容4:本科:英国布里斯托大学(University of Bristol);硕士:美国卡内基梅隆大学(CMU)
待核查内容5:职位:超级智能实验室(Meta Superintelligence Labs)核心成员
行动计划
让我首先查看OpenAI的GPT-4官方贡献者页面,确认Jack Rae是否参与了GPT-4的开发。
Core contributors
Christopher Berner Supercomputing lead
Greg Brockman Infrastructure lead
Trevor Cai Throughput lead
David Farhi Manager of optimization team
Chris Hesse Infrastructure usability co-lead
Shantanu Jain Infrastructure usability co-lead
Kyle Kosic Uptime and stability lead
Jakub Pachocki Overall lead, optimization lead
Alex Paino Architecture & data vice lead
Mikhail Pavlov Software correctness lead
Michael Petrov Hardware correctness lead
Nick Ryder Architecture & data lead
Szymon Sidor Optimization vice lead
Nikolas Tezak Execution lead
Phil Tillet Triton lead
Amin Tootoonchian Model distribution, systems & networking lead
Qiming Yuan Dataset sourcing and processing lead
Wojciech Zaremba Manager of dataset team
Compute cluster scaling
Christopher Berner, Oleg Boiko, Andrew Cann, Ben Chess, Christian Gibson, Mateusz Litwin, Emy Parparita, Henri Roussez, Eric Sigler, Akila Welihinda
Data
Sandhini Agarwal, Suchir Balaji, Mo Bavarian, Che Chang, Sheila Dunning, Leo Gao, Jonathan Gordon, Peter Hoeschele, Shawn Jain, Shantanu Jain, Roger Jiang, Heewoo Jun, Łukasz Kaiser, Nitish Shirish Keskar, Jong Wook Kim, Aris Konstantinidis, Chak Li, Todor Markov, Bianca Martin, David Mély, Oleg Murk, Hyeonwoo Noh, Long Ouyang, Alex Paino, Vitchyr Pong, Alec Radford, Nick Ryder, John Schulman, Daniel Selsam, Ian Sohl, Chelsea Voss, Lilian Weng, Clemens Winter, Tao Xu, Qiming Yuan, Wojciech Zaremba
Distributed training infrastructure
Greg Brockman, Trevor Cai, Chris Hesse, Shantanu Jain, Yongjik Kim, Kyle Kosic, Mateusz Litwin, Jakub Pachocki, Mikhail Pavlov, Szymon Sidor, Nikolas Tezak, Madeleine Thompson, Amin Tootoonchian, Qiming Yuan
Hardware correctness
Greg Brockman, Shantanu Jain, Kyle Kosic, Michael Petrov, Nikolas Tezak, Amin Tootoonchian, Chelsea Voss, Qiming Yuan
Optimization & architecture
Igor Babuschkin, Mo Bavarian, Adrien Ecoffet, David Farhi, Jesse Han, Ingmar Kanitscheider, Daniel Levy, Jakub Pachocki, Alex Paino, Mikhail Pavlov, Nick Ryder, Szymon Sidor, Jie Tang, Jerry Tworek, Tao Xu
Training run babysitting
Suchir Balaji, Mo Bavarian, Greg Brockman, Trevor Cai, Chris Hesse, Shantanu Jain, Roger Jiang, Yongjik Kim, Kyle Kosic, Mateusz Litwin, Jakub Pachocki, Alex Paino, Mikhail Pavlov, Michael Petrov, Nick Ryder, Szymon Sidor, Nikolas Tezak, Madeleine Thompson, Phil Tillet, Amin Tootoonchian, Chelsea Voss, Ben Wang, Tao Xu, Qiming Yuan
Core contributors
Trevor Cai Execution lead
Mark Chen Vision team co-lead, Deployment lead
Casey Chu Initial prototype lead
Chris Hesse Data load balancing & developer tooling lead
Shengli Hu Vision Safety Evaluations lead
Yongjik Kim GPU performance lead
Jamie Kiros Overall vision co-lead, deployment research & evaluation lead
Daniel Levy Overall vision co-lead, optimization lead
Christine McLeavey Vision team lead
David Mély Data lead
Hyeonwoo Noh Overall vision co-lead, research lead
Mikhail Pavlov Scaling engineering lead
Raul Puri Overall vision co-lead, engineering lead
Amin Tootoonchian Model distribution, systems & networking lead
Architecture research
Casey Chu, Jamie Kiros, Christine McLeavey, Hyeonwoo Noh, Raul Puri, Alec Radford, Aditya Ramesh
Compute cluster scaling
Andrew Cann, Rory Carmichael, Christian Gibson, Henri Roussez, Akila Welihinda
Distributed training infrastructure
Trevor Cai, Yunxing Dai, Chris Hesse, Brandon Houghton, Yongjik Kim, Łukasz Kondraciuk, Hyeonwoo Noh, Mikhail Pavlov, Raul Puri, Nikolas Tezak, Amin Tootoonchian, Tianhao Zheng
Hardware correctness
Oleg Boiko, Trevor Cai, Michael Petrov, Alethea Power
Data
Jong Wook Kim, David Mély, Reiichiro Nakano, Hyeonwoo Noh, Long Ouyang, Raul Puri, Pranav Shyam, Tao Xu
Alignment Data
Long Ouyang
Training run babysitting
Trevor Cai, Kyle Kosic, Daniel Levy, David Mély, Reiichiro Nakano, Hyeonwoo Noh, Mikhail Pavlov, Raul Puri, Amin Tootoonchian
Deployment & post-training
Ilge Akkaya, Mark Chen, Jamie Kiros, Rachel Lim, Reiichiro Nakano, Raul Puri, Jiayi Weng
Core contributors
Greg Brockman Core infrastructure author
Arka Dhar Human data product manager
Liam Fedus Data flywheel lead
Tarun Gogineni Model creativity
Rapha Gontijo-Lopes Synthetic data
Joshua Gross Data collection engineering co-lead
Johannes Heidecke Refusals & model safety co-lead
Joost Huizinga Initial fine-tuning derisking
Teddy Lee Human data product manager
Jan Leike Alignment co-lead
Ryan Lowe Alignment co-lead
Luke Metz Infrastructure lead, ChatML format lead
Long Ouyang IF data collection lead
John Schulman Overall lead
Jerry Tworek Code lead
Carroll Wainwright IF data infrastructure lead
Jonathan Ward Data collection engineering co-lead
Jiayi Weng RL Infrastructure author
Sarah Yoo Human data operations manager
Wojciech Zaremba Human data lead
Chong Zhang Refusals & model safety co-lead
Shengjia Zhao Reward model lead
Barret Zoph Overall training lead
Dataset contributions
Diogo Almeida, Mo Bavarian, Juan Felipe Cerón Uribe, Tyna Eloundou, Liam Fedus, Tarun Gogineni, Rapha Gontijo-Lopes, Jonathan Gordon, Joost Huizinga, Shawn Jain, Roger Jiang, Łukasz Kaiser, Christina Kim, Jan Leike, Chak Li, Stephanie Lin, Ryan Lowe, Jacob Menick, Luke Metz, Pamela Mishkin, Tong Mu, Oleg Murk, Ashvin Nair, Long Ouyang, Alex Passos, Michael (Rai) Pokorny, Vitchyr Pong, Shibani Santurkar, Daniel Selsam, Sarah Shoker,, Carroll Wainwright, Matt Wiethoff, Jeff Wu, Kai Xiao, Kevin Yu, Marvin Zhang, Chong Zhang, William Zhuk, Barret Zoph
Data infrastructure
Irwan Bello, Lenny Bogdonoff, Juan Felipe Cerón Uribe, Joshua Gross, Shawn Jain, Haozhun Jin, Christina Kim, Aris Konstantinidis, Teddy Lee, David Medina, Jacob Menick, Luke Metz, Ashvin Nair,Long Ouyang, Michael (Rai) Pokorny, Vitchyr Pong, John Schulman, Jonathan Ward, Jiayi Weng, Matt Wiethoff, Sarah Yoo, Kevin Yu, Wojciech Zaremba, William Zhuk, Barret Zoph
ChatML format
Ilge Akkaya, Christina Kim, Chak Li, Rachel Lim, Jacob Menick, Luke Metz, Andrey Mishchenko, Vitchyr Pong, John Schulman, Carroll Wainwright, Barret Zoph
Model safety
Josh Achiam, Steven Adler, Juan Felipe Cerón Uribe, Hyung Won Chung, Tyna Eloundou, Rapha Gontijo-Lopes, Shixiang Shane Gu, Johannes Heidecke, Joost Huizinga, Teddy Lee, Jan Leike, Stephanie Lin, Ryan Lowe, Todor Markov, Luke Metz, Tong Mu, Shibani Santurkar, John Schulman, Andrea Vallone, Carroll Wainwright, Jason Wei, Lilian Weng, Kai Xiao, Chong Zhang, Marvin Zhang, Barret Zoph
Refusals
Juan Felipe Cerón Uribe, Tyna Eloundou, Johannes Heidecke, Joost Huizinga, Jan Leike, Stephanie Lin, Ryan Lowe, Pamela Mishkin, Tong Mu, Carroll Wainwright, Lilian Weng, Kai Xiao, Chong Zhang, Barret Zoph
Foundational RLHF and InstructGPT work
Diogo Almeida, Joost Huizinga, Roger Jiang, Jan Leike, Stephanie Lin, Ryan Lowe, Pamela Mishkin, Dan Mossing, Long Ouyang, Katarina Slama, Carroll Wainwright, Jeff Wu, Kai Xiao, Marvin Zhang
Flagship training runs
Greg Brockman, Liam Fedus, Johannes Heidecke, Joost Huizinga, Roger Jiang, Kyle Kosic, Luke Metz, Ashvin Nair, Jiayi Weng, Chong Zhang, Shengjia Zhao, Barret Zoph
Code capability
Ilge Akkaya, Mo Bavarian, Jonathan Gordon, Shawn Jain, Haozhun Jin, Teddy Lee, Chak Li, Oleg Murk, Ashvin Nair, Vitchyr Pong, Benjamin Sokolowsky, Jerry Tworek, Matt Wiethoff, Sarah Yoo, Kevin Yu, Wojciech Zaremba, William Zhuk
Core contributors
Sandhini Agarwal System Card co-lead
Lama Ahmad Expert red teaming & adversarial testing program lead
Mo Bavarian Capability prediction co-lead
Tyna Eloundou Safety evaluations co-lead
Andrew Kondrich OpenAI Evals open-sourcing co-lead
Gretchen Krueger System Card co-lead
Michael Lampe Privacy and PII evaluations lead
Pamela Mishkin Economic impact & overreliance evaluations lead
Benjamin Sokolowsky Capability prediction co-lead
Jack Rae Research benchmark execution lead
Chelsea Voss Eval execution lead
Alvin Wang OpenAI Evals lead
Kai Xiao Safety evaluations co-lead
Marvin Zhang OpenAI Evals open-sourcing co-lead
OpenAI Evals library
Shixiang Shane Gu, Angela Jiang, Logan Kilpatrick, Andrew Kondrich, Pamela Mishkin, Jakub Pachocki, Ted Sanders, Jessica Shieh, Alvin Wang, Marvin Zhang
Model-graded evaluation infrastructure
Liam Fedus, Rapha Gontijo-Lopes, Shixiang Shane Gu, Andrew Kondrich, Michael (Rai) Pokorny, Wojciech Zaremba, Chong Zhang, Marvin Zhang, Shengjia Zhao, Barret Zoph
Acceleration forecasting
Alan Hickey, Daniel Kokotajlo, Cullen O’Keefe, Sarah Shoker
ChatGPT evaluations
Juan Felipe Cerón Uribe, Hyung Won Chung, Rapha Gontijo-Lopes, Liam Fedus, Luke Metz, Michael Rai Pokorny, Jason Wei, Shengjia Zhao, Barret Zoph
Capability evaluations
Sully Chen, Tyna Eloundou, Shengli Hu, Roger Jiang, Jamie Kiros, Teddy Lee, Scott Mayer McKinney, Jakub Pachocki, Alex Paino, Giambattista Parascandolo, Boris Power, Raul Puri, Jack Rae, Nick Ryder, Ted Sanders, Szymon Sidor, Benjamin Sokolowsky, Chelsea Voss, Alvin Wang, Rowan Zellers, Juntang Zhuang
Coding evaluations
Ilge Akkaya, Mo Bavarian, Jonathan Gordon, Shawn Jain, Chak Li, Oleg Murk, Vitchyr Pong, Benjamin Sokolowsky, Jerry Tworek, Kevin Yu, Wojciech Zaremba
Real-world use case evaluations
Andrew Kondrich, Joe Palermo, Boris Power, Ted Sanders
Contamination investigations
Adrien Ecoffet, Roger Jiang, Ingmar Kanitscheider, Scott Mayer McKinney, Alex Paino, Giambattista Parascandolo, Jack Rae, Qiming Yuan
Instruction following and API evals
Diogo Almeida, Carroll Wainwright, Marvin Zhang
Novel capability discovery
Filipe de Avila Belbute Peres, Kevin Button, Fotis Chantzis, Mike Heaton, Wade Hickey, Xin Hu, Andrew Kondrich, Matt Knight, Andrew Mayne, Jake McNeil, Vinnie Monaco, Joe Palermo, Joel Parish, Boris Power, Bob Rotsted, Ted Sanders
Vision evaluations
Shixiang Shane Gu, Shengli Hu, Jamie Kiros, Hyeonwoo Noh, Raul Puri, Rowan Zellers
Economic impact evaluation
Tyna Eloundou, Sam Manning, Aalok Mehta, Pamela Mishkin
Non-proliferation, international humanitarian law & national security red teaming
Sarah Shoker
Overreliance analysis
Miles Brundage, Michael Lampe, Pamela Mishkin
Privacy and PII evaluations
Michael Lampe, Vinnie Monaco, Ashley Pantuliano
Safety and policy evaluations
Josh Achiam, Sandhini Agarwal, Lama Ahmad, Jeff Belgum, Tyna Eloundou, Johannes Heidecke, Shengli Hu, Joost Huizinga, Jamie Kiros, Gretchen Krueger, Michael Lampe, Stephanie Lin, Ryan Lowe, Todor Markov, Vinnie Monaco, Tong Mu, Raul Puri, Girish Sastry, Andrea Vallone, Carroll Wainwright, CJ Weinmann, Lilian Weng, Kai Xiao, Chong Zhang
OpenAI adversarial testers
Josh Achiam, Steven Adler, Lama Ahmad, Shyamal Anadkat, Red Avila, Gabriel Bernadett-Shapiro, Anna-Luisa Brakman, Tim Brooks, Miles Brundage, Chelsea Carlson, Derek Chen, Hyung Won Chung, Jeremiah Currier, Daniel Kokotajlo, David Dohan, Adrien Ecoffet, Juston Forte, Vik Goel, Ryan Greene, Johannes Heidecke, Alan Hickey, Shengli Hu, Joost Huizinga, Janko, Tomer Kaftan, Ali Kamali, Nitish Shirish Keskar, Tabarak Khan, Hendrik Kirchner, Daniel Kokotajlo, Gretchen Krueger, Michael Lampe, Teddy Lee, Molly Lin, Ryan Lowe, Todor Markov, Jake McNeil, Pamela Mishkin, Vinnie Monaco, Daniel Mossing, Tong Mu, Oleg Murk, Cullen O’Keefe, Joe Palermo, Giambattista Parascandolo, Joel Parish, Boris Power, Alethea Power, Cameron Raymond, Francis Real, Bob Rotsted, Mario Salterelli, Sam Wolrich, Ted Sanders, Girish Sasty, Sarah Shoker, Shyamal Anadkat, Yang Song, Natalie Staudacher, Madeleine Thompson, Elizabeth Tseng, Chelsea Voss, Jason Wei, Chong Zhang
System card & broader impacts analysis
Steven Adler, Sandhini Agarwal, Lama Ahmad, Janko Altenschmidt, Jeff Belgum, Gabriel Bernadett-Shapiro, Miles Brundage, Derek Chen, Tyna Eloundou, Liam Fedus, Leo Gao, Vik Goel, Johannes Heidecke, Alan Hickey, Shengli Hu, Joost Huizinga, Daniel Kokotajlo, Gretchen Krueger, Michael Lampe, Jade Leung, Stephanie Lin, Ryan Lowe, Kim Malfacini, Todor Markov, Bianca Martin, Aalok Mehta, Pamela Mishkin, Tong Mu, Richard Ngo, Cullen O’Keefe, Joel Parish, Rai Pokorny, Bob Rotsted, Girish Sastry, Sarah Shoker, Andrea Vallone, Carroll Wainwright, CJ Weinmann, Lilian Weng, Dave Willner, Kai Xiao, Chong Zhang
Core contributors
Steven Adler Early stage program management lead
Sandhini Agarwal Launch safety lead
Derek Chen Monitoring & response lead
Atty Eleti GPT‑4 API co-lead
Joanne Jang GPT‑4 product co-lead
Angela Jiang GPT‑4 product co-lead
Tomer Kaftan Inference infrastructure & deployment lead
Rachel Lim GPT‑4 API co-lead
Kim Malfacini Usage policy lead
Bianca Martin Release program management lead
Evan Morikawa Engineering lead
Henrique Ponde de Oliveira Pinto Inference workflow lead
Heather Schmidt GPT‑4 infrastructure management
Maddie Simens Design lead
Felipe Petroski Such Inference optimization & reliability lead
Andrea Vallone Detection & refusals policy lead
Lilian Weng Applied research lead
Dave Willner Trust & safety lead
Michael Wu Inference research lead
Inference research
Paul Baltescu, Scott Gray, Yuchen He, Arvind Neelakantan, Michael Wu
GPT‑4 API & ChatML deployment
Greg Brockman, Brooke Chan, Chester Cho, Atty Eleti, Rachel Lim, Andrew Peng, Michelle Pokrass, Sherwin Wu
GPT‑4 web experience
Valerie Balcom, Lenny Bogdonoff, Jason Chen, Dave Cummings, Noah Deutsch, Mike Heaton, Paul McMillan, Rajeev Nayak, Joel Parish, Adam Perelman, Eric Sigler, Nick Turley, Arun Vijayvergiya, Chelsea Voss
Inference infrastructure
Brooke Chan, Scott Gray, Chris Hallacy, Kenny Hsu, Tomer Kaftan, Rachel Lim, Henrique Ponde de Oliveira Pinto, Raul Puri, Heather Schmidt, Felipe Petroski Such
Reliability engineering
Haiming Bao, Madelaine Boyd, Ben Chess, Damien Deville, Yufei Guo, Vishal Kuo, Ikai Lan, Michelle Pokrass, Carl Ross, David Schnurr, Jordan Sitkin, Felipe Petroski Such
Trust & safety engineering
Jeff Belgum, Madelaine Boyd, Vik Goel
Trust & safety monitoring and response
Janko Altenschmidt, Anna-Luisa Brakman, Derek Chen, Florencia Leoni Aleman, Molly Lin, Cameron Raymond, CJ Weinmann, Dave Willner, Samuel Wolrich
Trust & safety policy
Rosie Campbell, Kim Malfacini, Andrea Vallone, Dave Willner
Deployment compute
Peter Hoeschele, Evan Morikawa
Product management
Jeff Harris, Joanne Jang, Angela Jiang
Sam Altman, Katie Mayer, Bob McGrew, Mira Murati, Ilya Sutskever, Peter Welinder
Blog post & paper content
Sandhini Agarwal, Greg Brockman, Miles Brundage, Adrien Ecoffet, Tyna Eloundou, David Farhi, Johannes Heidecke, Shengli Hu, Joost Huizinga, Roger Jiang, Gretchen Krueger, Jan Leike, Daniel Levy, Stephanie Lin, Ryan Lowe, Tong Mu, Hyeonwoo Noh, Jakub Pachocki, Jack Rae, Kendra Rimbach, Shibani Santurkar, Szymon Sidor, Benjamin Sokolowsky, Jie Tang, Chelsea Voss, Kai Xiao, Rowan Zellers, Chong Zhang, Marvin Zhang
Communications
Ruby Chen, Cory Decareaux, Thomas Degry, Steve Dowling, Niko Felix, Elie Georges, Anna Makanju, Andrew Mayne, Aalok Mehta, Elizabeth Proehl, Kendra Rimbach, Natalie Summers, Justin Jay Wang, Hannah Wong
Compute allocation support
Theresa Lopez, Elizabeth Tseng
Contracting, revenue, pricing & finance support
Brooke Chan, Denny Jin, Billie Jonn, Patricia Lue, Kyla Sheppard, Lauren Workman
Launch partners & product operations
Filipe de Avila Belbute Peres, Brittany Carey, Simón Posada Fishman, Isabella Fulford, Teddy Lee, Yaniv Markovski, Tolly Powell, Toki Sherbakov, Jessica Shieh, Natalie Staudacher, Preston Tuggle
Legal
Jake Berdine, Che Chang, Sheila Dunning, Ashley Pantuliano
Security & privacy engineering
Kevin Button, Fotis Chantzis, Wade Hickey, Xin Hu, Shino Jomoto, Matt Knight, Jake McNeil, Vinnie Monaco, Joel Parish, Bob Rotsted
System administration & on-call support
Morgan Grafstein, Francis Real, Mario Saltarelli
Authorship & credit attribution
David Farhi
We also acknowledge and thank every OpenAI team member not explicitly mentioned above, including the amazing people on the executive assistant, finance, go to market, human resources, legal, operations and recruiting teams. From hiring everyone in the company, to making sure we have an amazing office space, to building the administrative, HR, legal, and financial structures that allow us to do our best work, everyone at OpenAI has contributed to GPT‑4.
We thank Microsoft for their partnership, especially Microsoft Azure for supporting model training with infrastructure design and management, and the Microsoft Bing team and Microsoft’s safety teams for their partnership on safe deployment.
We are grateful to our expert adversarial testers and red teamers who helped test our models at early stages of development and informed our risk assessments as well as the system card. Participation in this red teaming process is not an endorsement of the deployment plans of OpenAI or OpenAI’s policies: Steven Basart, Sophie Duba, Cèsar Ferri, Heather Frase, Gavin Hartnett, Jake J. Hecla, Dan Hendrycks, Jose Hernandez-Orallo, Alice Hunsberger, Rajiv W. Jain, Boru Gollo Jattani, Lauren Kahn, Dan Kaszeta, Sara Kingsley, Noam Kolt, Nathan Labenz, Eric Liddick, Andrew J. Lohn, Andrew MacPherson, Sam Manning, Mantas Mazeika, Anna Mills, Yael Moros, Jimin Mun, Aviv Ovadya, Roya Pakzad, Yifan Peng, Ciel Qi, Alex Rosenblatt, Paul Röttger, Maarten Sap, Wout Schellaert, George Shih, Muhammad Shoker, Melanie Subbiah, Bryan West, Andrew D. White, Anna Katariina Wisakanto, Akhila Yerukola, Lexin Zhou, Xuhui Zhou.
Contributors listed in alphabetized order.
我需要继续查找关于Jack Rae在Gopher项目中的角色和贡献,特别是关于"将人类考试水平差距缩小一半"的说法。
Search
Jack Rae Gopher DeepMind halves gap human expert performance我需要查找Gopher论文的原文,以确认Jack Rae的具体角色和"将人类考试水平差距缩小一半"的说法。
Title:
URL Source: https://storage.googleapis.com/deepmind-media/research/language-research/Training%20Gopher.pdf
Published Time: Wed, 08 Dec 2021 13:14:20 GMT
Markdown Content:
2021-12-08
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu and Geoffrey Irving
Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales — from models with tens of millions of parameters up to a 280 billion parameter model called
Gopher . These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model’s behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.
Keywords: Natural Language Processing, Language Models, Deep Learning
Contents
1 Introduction 32 Background 53 Method 5
3.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.3 Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.4 Training Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Results 7
4.1 Task Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.2 Comparisons with State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Corresponding authors: jwrae@deepmind.com, geoffreyi@deepmind.com
© 2021 DeepMind. All rights reserved Scaling Language Models: Methods, Analysis & Insights from Training Gopher
4.3 Performance Improvements with Scale . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5Toxicity and Bias Analysis 13
5.1 Toxicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.2 Distributional Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6Dialogue 17
6.1 Prompting For Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 6.2 Fine-tuning for Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 6.3 Dialogue & Toxicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7Discussion 20
7.1 Towards Efficient Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 7.2 Challenges in Toxicity and Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 7.3 Safety benefits and safety risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
8Conclusion 23 9Acknowledgements 23 10 Contributions 24 AMassiveText 39
A.1 Dataset Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 A.2 Dataset Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 A.3 Dataset Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 A.4 Text normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 A.5 MassiveText Datasheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
BGopher Model Card 49 CLessons Learned 53
C.1 Adafactor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 C.2 Lower-Precision Training with bfloat16 . . . . . . . . . . . . . . . . . . . . . . . . . 53
DResults 55
D.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 D.2 Pile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 D.3 Language Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 D.4 Filtering Test-Set Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 D.5 Scaling Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 D.6 Scaling Context Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 D.7 MMLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 D.8 BIG-bench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2Scaling Language Models: Methods, Analysis & Insights from Training Gopher
D.9 TriviaQA & NaturalQuestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 D.10 TruthfulQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 D.11 Reading Comprehension: RACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 D.12 Fact-Checking: FEVER & MultiFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 D.13 Common Sense: PIQA, WinoGrande, SocialIQA, HellaSwag . . . . . . . . . . . . . . . 86
EToxicity and Bias Analysis 88
E.1 Toxic Generations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 E.2 Classifying Toxicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 E.3 Distributional Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
FCompute Usage 101 GReducing Inference and Training Costs 102
G.1 Efficient Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 G.2 Reducing Inference Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 G.3 Reducing Training Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 G.4 Future Work for Efficient Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
HDialogue-Prompted Gopher Details 111
H.1 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 H.2 Dialogue Dataset Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 H.3 Comparison Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 H.4 RTP in a Dialogue Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 H.5 Selected Transcripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3Scaling Language Models: Methods, Analysis & Insights from Training Gopher
text statistics (Dewey, 1923) relating model complexity to improved text compression alongside more realistic text generation. 1 But the relation to intelligence was there from the start: Shannon posits that a sufficiently complex model will resemble human communication adequately, and the Imitation Game (Turing, 1950) cemented the link. The relation between data compression (via prediction) and intelligence has been further expanded upon since (see Chater (1999); Legg and Hutter (2007); Wolff (1982)). A key driver towards better language models has been modern computing. From their pen-and-paper origins, language models have transformed in capacity and predictive power by the exponential rise in compute (Moore et al., 1965). In the 1990s and 2000s, 푛-gram models saw increases in scale and better smoothing approaches (Ney et al., 1994), including a 300 billion 푛-gram model trained on two trillion tokens of text (Brants et al., 2007). These models have been applied to speech recognition (Jelinek, 1997), spelling correction (Brill and Moore, 2000), machine translation (Brown et al., 1990), and many other areas. However 푛-gram models become statistically and computationally inefficient as the context length is increased, which limits the richness of language they can model. In the past two decades language models have progressed to neural networks that capture the structure of language implicitly (Bengio et al., 2003; Graves, 2013; Jozefowicz et al., 2016; Mikolov et al., 2010; Radford et al., 2019). Progress has been driven by both scale and network architecture (Bahdanau et al., 2014; Hochreiter and Schmidhuber, 1997; Vaswani et al., 2017). Kaplan et al. (2020) found power laws relating cross entropy loss to model and training dataset size for Transformer models. The empirically predicted gains to scale were realised in practice by the Generative Pre-trained Transformer 3 (GPT-3, Brown et al. (2020)), a 175 billion parameter Transformer trained over 300 billion tokens of text, which consumed zettaflops of compute to train — an order of magnitude beyond prior work (Rosset, 2020). GPT-3 demonstrated unprecedented generation quality alongside generalist capabilities across many Natural Language Processing (NLP) tasks — notably when prompted with examples (termed few-shot prompting). In this paper we describe a protocol for training a state-of-the-art large language model and present a 280 billion parameter model called Gopher . We outline the methods of architecture specification, optimisation, infrastructure, and the curation of a high-quality text dataset MassiveText in Section 3. We perform a broad analysis of benchmark performance across 152 tasks that examine several diverse aspects of intelligence, and summarise the key results in Section 4. We see that Gopher lifts the performance over current state-of-the-art language models across roughly 81% of tasks containing comparable results, notably in knowledge-intensive domains such as fact checking and general knowledge. As harmful content occurs both in Gopher ’s training set and in many potential downstream applications, we examine model toxicity and bias in Section 5 with a focus on how scale influences these properties. We find larger models are more likely to generate toxic responses when provided with toxic prompts, but they can also more accurately classify toxicity. We also analyse Gopher in a dialogue-interaction setting in Section 6 via prompting and present several transcripts to demonstrate qualitative capabilities and limitations of the model. Finally, we discuss the ethical and safe application of these models including which types of undesirable behaviour to mitigate before and after training in Section 7. We discuss application-driven safety and the potential for language models to accelerate research towards safer intelligent technology.
1A sample from Shannon’s word-pair model: “the head and in frontal attack on an english writer that the character of this point is therefore another method for the letters that the time of who ever told the problem for an unexpected.”
4Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Œ푛푖=1 푃(푋푖 |푋<푖 ). This is also known as autoregressive sequence modelling, because at each time-step the future (in this case, future token) is predicted based upon the past context. Whilst there are other objectives towards modelling a sequence, such as modelling masked tokens given bi-directional context (Devlin et al., 2019; Mikolov et al., 2013) and modelling all permutations of the sequence (Yang et al., 2019) we focus on autoregressive modelling due to its strong performance and simplicity. We shall refer to language models hereon as the function approximators to perform next-token prediction. A class of neural networks known as Transformers (Vaswani et al., 2017) have demonstrated state-of-the-art language model performance in recent years (Dai et al., 2019; Radford et al., 2018, 2019) and this is the architecture we focus on in this paper. There has been a trend of scaling the combination of training data, model size (measured in parameters) and training computation to obtain models with improved performance across academic and industrial benchmarks. Notable models along this progression include the 345 million parameter BERT (Devlin et al., 2019) performing strongly across a wide benchmark of language classification tasks, the 1.5 billion parameter GPT-2 (Radford et al., 2018) and 8.3 billion parameter Megatron (Shoeybi et al., 2019) displaying progressively superior zero-shot language model performance, the 11 billion parameter T5 (Raffel et al., 2020a) which advanced transfer learning and performance on several closed-book question answering tasks, and the aforementioned 175 billion parameter GPT-3. The moniker Large Language Models (LLMs) has become popular to describe this generation of larger models. Since GPT-3 there has been a 178B parameter Transformer language model Jurassic-1 (Lieber et al., 2021) which uses a diverse training set and a larger tokenizer vocabulary size, along with an announced 530B Megatron-Turing NLG (Kharya and Alvi, 2021) which trains on a released dataset (The Pile, Gao et al. (2020)) (which we evaluate on) and has reported some tentative performance numbers. There have also been Transformer variants which incorporate a sparse mixture of experts (Fedus et al., 2021; Roller et al., 2021b) to increase the model size (in some cases to trillions of parameters) with more modest compute budgets. Other recent LLMs include two models (FLAN and T0) fine-tuned on instructions for an array of down-stream tasks (Sanh et al., 2021; Wei et al., 2021) which improves performance to unseen tasks — these ideas are complementary to the initial task of building a powerful language model but we compare performance nonetheless where possible.
3. Method
3.1. Models
In this paper we present results on six Transformer language models ranging from 44 million to 280 billion parameters, with the architectural details displayed in Table 1. We refer to the largest as
Gopher and the entire set of models as the Gopher family.
5Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Model Layers Number Heads Key/Value Size dmodel Max LR Batch Size
44M 8 16 32 512 6 × 10 −4 0.25M 117M 12 12 64 768 6 × 10 −4 0.25M 417M 12 12 128 1,536 2 × 10 −4 0.25M 1.4B 24 16 128 2,048 2 × 10 −4 0.25M 7.1B 32 32 128 4,096 1.2 × 10 −4 2M
Gopher 280B 80 128 128 16,384 4 × 10 −5 3M → 6M Table 1 | Model architecture details. For each model, we list the number of layers, the key/value size, the bottleneck activation size d model , the maximum learning rate, and the batch size. The feed-forward size is always 4 × dmodel .We use the autoregressive Transformer architecture detailed in Radford et al. (2019) with two modifications: we use RMSNorm (Zhang and Sennrich, 2019) instead of LayerNorm (Ba et al., 2016), and we use the relative positional encoding scheme from Dai et al. (2019) rather than absolute positional encodings. Relative encodings permit us to evaluate on longer sequences than we trained on, which improves the modelling of articles and books as shown in Section D.6. We tokenize the text using SentencePiece (Kudo and Richardson, 2018) with a vocabulary of 32,000 and use a byte-level backoff to support open-vocabulary modelling. The Gopher model card (Mitchell et al., 2019) is included in Appendix B.
3.2. Training
We train all models for 300 billion tokens with a 2048 token context window, using the Adam (Kingma and Ba, 2014) optimiser. We warm-up the learning rate from 10 −7 to the maximum learning rate over the first 1500 steps, and then decay it 10 × using a cosine schedule. As we increase model size, we decrease the maximum learning rate and increase the number of tokens in each batch, as shown in Table 1. Furthermore, we increase Gopher ’s batch size from three to six million tokens per batch during training. We clip gradients based on the global gradient norm using a clipping value of 1. However, for the 7.1B model and for Gopher we reduce this to 0.25 for improved stability. We incorporate the bfloat16 numerical format to reduce memory and increase training through-put. Models smaller than 7.1B are trained with mixed precision float32 parameters and bfloat16
activations (Micikevicius et al., 2018), while 7.1B and 280B use bfloat16 activations and parameters.
bfloat16 parameters are updated using stochastic rounding to ...
我需要查找更多关于Gopher在阅读理解方面的具体成就,特别是关于"将人类考试水平差距缩小一半"的说法。
Search
Gopher DeepMind almost halves the accuracy gap from GPT-3 to human expert performance已核查结论清单
已核查内容1:Jack Rae 的职业经历可以分为以下几个阶段:...DeepMind(约2015–2023)...OpenAI(2023–2024)...Google DeepMind(2024–2025)...Meta(2025年6月起)
已核查内容2:博士:英国伦敦大学学院(UCL),计算机科学方向
已核查内容3:主导项目:Gopher:2800亿参数的语言模型,首次将人类考试水平差距缩小一半
已核查内容4:Chinchilla:提出“Chinchilla Scaling Laws”...2022年NeurIPS最佳论文
已核查内容5:GPT-4 核心开发者之一,参与预训练与推理系统设计
已核查内容6:Gemini 系列:Gemini 1.0/2.0 的预训练负责人,Gemini 2.5 的推理技术负责人
已核查内容7:背景:被扎克伯格亲自挖角,成为Meta组建AGI团队的首批关键人物之一
待补充核查清单
待核查内容1:提出“压缩即智能”理论:认为大模型训练本质是对数据的无损压缩,压缩越高效,模型越智能
待核查内容2:本科:英国布里斯托大学(University of Bristol);硕士:美国卡内基梅隆大学(CMU)
待核查内容3:职位:超级智能实验室(Meta Superintelligence Labs)核心成员
行动计划
让我首先搜索关于"压缩即智能"理论的信息,以确定Jack Rae在这一理论中的具体贡献。
Search
compression is intelligence theory history Solomonoff Kolmogorov Schmidhuber Jack Rae我需要继续搜索关于"压缩即智能"理论的更多信息,特别是关于Jack Rae在这一理论中的贡献。
Search
Jack Rae Compression is Intelligence Stanford MLSys Seminar YouTube我需要继续搜索关于"压缩即智能"理论的历史渊源,特别是Solomonoff和Kolmogorov的贡献。
Search
Solomonoff Kolmogorov compression theory algorithmic information theory history我需要查找更多关于Jack Rae的教育背景信息,以及Meta Superintelligence Labs的相关信息。
Search
Jack Rae education background Bristol University CMU LinkedIn已核查结论清单
已核查内容1:Jack Rae 的职业经历可以分为以下几个阶段:...DeepMind(约2015–2023)...OpenAI(2023–2024)...Google DeepMind(2024–2025)...Meta(2025年6月起)
已核查内容2:本科:英国布里斯托大学(University of Bristol);硕士:美国卡内基梅隆大学(CMU);博士:英国伦敦大学学院(UCL),计算机科学方向
已核查内容3:主导项目:Gopher:2800亿参数的语言模型,首次将人类考试水平差距缩小一半
已核查内容4:Chinchilla:提出“Chinchilla Scaling Laws”...2022年NeurIPS最佳论文
已核查内容5:GPT-4 核心开发者之一,参与预训练与推理系统设计
已核查内容6:提出“压缩即智能”理论:认为大模型训练本质是对数据的无损压缩,压缩越高效,模型越智能
已核查内容7:Gemini 系列:Gemini 1.0/2.0 的预训练负责人,Gemini 2.5 的推理技术负责人
已核查内容8:背景:被扎克伯格亲自挖角,成为Meta组建AGI团队的首批关键人物之一
待补充核查清单
待核查内容1:职位:超级智能实验室(Meta Superintelligence Labs)核心成员
行动计划
让我搜索关于Meta Superintelligence Labs的官方信息。
Search
Meta Superintelligence Labs official announcement Zuckerberg我需要继续查找关于Meta Superintelligence Labs的官方信息,特别是Mark Zuckerberg的官方声明。
Search
Mark Zuckerberg Meta Superintelligence Labs official announcement memo我需要查找更多关于Jack Rae在Meta Superintelligence Labs中的具体角色的信息。
Search
Jack Rae Meta Superintelligence Labs role position现在我已经收集了足够的信息,可以提供一个全面的核查结果。
本次核查共涉及10项主要陈述,其中5项基本属实,3项部分属实,2项不属实。不属实内容主要包括Chinchilla论文获得NeurIPS 2022最佳论文奖的说法,以及对Jack Rae在GPT-4开发中角色的不准确描述。部分属实内容主要涉及对其在Chinchilla项目中角色的夸大,以及将"压缩即智能"理论归功于他个人的不准确说法。此外,关于Meta Superintelligence Labs的具体名称已得到官方确认,但Jack Rae在该实验室中的具体职位描述仍需更多官方信息佐证。
用户报告中关于Jack Rae职业经历的总体框架是准确的,他确实横跨了DeepMind、OpenAI和Meta三大顶尖AI机构。然而,报告中对其具体贡献和角色的描述存在一定程度的夸大或不准确,建议在引用相关信息时谨慎对待,特别是关于其在特定项目中的领导角色和理论贡献方面。
原文陈述 | 可靠性分析 | 证据 |
---|---|---|
本科:英国布里斯托大学(University of Bristol);硕士:美国卡内基梅隆大学(CMU);博士:英国伦敦大学学院(UCL),计算机科学方向 | 属实 多个独立信源证实Jack Rae的教育背景,包括UCL的官方知识库中的博士论文,以及其在OpenReview和MSN上的个人资料。 | 1. "I, Jack William Rae confirm that the work presented in this thesis is my own." https://discovery.ucl.ac.uk/10126220/1/Jack%20Rae%20-%20Final%20Thesis.pdf (未知日期) 2. "MS student University of Bristol (bristol.ac.uk) 2008 – 2017; MS student Carnegie Mellon University (cmu.edu) 2013 – 2014" https://openreview.net/profile?id=~Jack_William_Rae1 (未知日期) 3. "Education: Bachelor of Science (BS), University of Bristol, UK; Master of Science (MS), Carnegie Mellon University (CMU), USA." https://www.msn.com/en-in/money/news/who-are-the-11-ai-experts-hired-by-mark-zuckerberg-s-meta-fun-fact-all-of-them-are-immigrants/ar-AA1HT8Du (未知日期) |
DeepMind(约2015–2023)...OpenAI(2023–2024)...Google DeepMind(2024–2025)...Meta(2025年6月起) | 基本属实 多个独立信源证实Jack Rae的职业路径确实横跨了DeepMind、OpenAI、Google DeepMind(回归)和Meta,时间点与各方报道基本吻合。 | 1. "Research Engineer. DeepMind (google.com). 2014 – 2017." https://openreview.net/profile?id=~Jack_William_Rae1 (未知日期) 2. "Jack Rae 是 OpenAI 的团队负责人,主要研究大型语言模型和远程记忆。此前,他在 DeepMind 工作了 8 年,领导大型语言模型 (LLM) 研究组。" https://weibo.com/ttarticle/p/show?id=2309634884017727013209 (2023-03-27) 3. "最新爆料,谷歌DeepMind首席研究员Jack Rae已经确认转投。" https://m.36kr.com/p/3333286223030536 (2025-06-12) 4. "杰克·雷(Jack Rae):Gemini的预训练技术负责人,负责Gemini 2.5的推理工作。曾在DeepMind领导早期的Gopher和Chinchilla大语言模型项目。" https://m.weibo.cn/status/Pz1RNf6gX (2025-07-01) |
主导项目:Gopher:2800亿参数的语言模型,首次将人类考试水平差距缩小一半 | 基本属实,但表述存在简化 Jack W. Rae是Gopher论文的第一作者和通讯作者,确为项目主导者。论文中确实提到Gopher"几乎将从GPT-3到人类专家表现的准确性差距缩小了一半",但这一描述是与GPT-3对比,而非泛指所有人类考试。 | 1. "Jack W. Rae, Sebastian Borgeaud, Trevor Cai... Corresponding authors: jwrae@deepmind.com, geoffreyi@deepmind.com" https://storage.googleapis.com/deepmind-media/research/language-research/Training%20Gopher.pdf (2021-12-08) 2. "We find Gopher almost halves the accuracy gap from GPT-3 to human expert performance and exceeds forecaster expectations." https://storage.googleapis.com/deepmind-media/research/language-research/Training%20Gopher.pdf (2021-12-08) |
Chinchilla:提出"Chinchilla Scaling Laws" | 部分属实 Jack Rae 是 Chinchilla 论文《Training Compute-Optimal Large Language Models》的作者之一,该论文确实提出了关于计算最优训练的扩展法则。但将他个人描述为"提出者"不准确,该论文有22位作者,他是资深作者之一(倒数第三位),而非第一作者。 | "Authors: Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, ... Jack W. Rae, Oriol Vinyals, Laurent Sifre" https://arxiv.org/abs/2203.15556 (2022-03-29) |
2022年NeurIPS最佳论文 | 不属实 NeurIPS 2022 官方博客发布的获奖论文名单中,并不包含《Training Compute-Optimal Large Language Models》(Chinchilla论文)。 | "We are excited to announce the award-winning papers for NeurIPS 2022! ... Outstanding Papers: Is Out-of-distribution Detection Learnable? ..." https://blog.neurips.cc/2022/11/21/announcing-the-neurips-2022-awards/ (2022-11-21) |
Gemini 系列:Gemini 1.0/2.0 的预训练负责人,Gemini 2.5 的推理技术负责人 | 基本属实 多方信源,包括据称为扎克伯格的内部备忘录,均指认 Jack Rae 在 Gemini 项目中担任关键领导角色,负责预训练和推理技术。 | 1. "杰克·雷(Jack Rae):Gemini的预训练技术负责人,负责Gemini 2.5的推理工作。" https://m.weibo.cn/status/Pz1RNf6gX (2025-07-01) 2. "我们邀请到了 Google DeepMind 的首席科学家 Jack Rae,他也是 Gemini 推理项目的主要负责人之一。" https://m.sohu.com/a/864740570_122105141/?pvid=000115_3w_a (2025-02-27) |
GPT-4 核心开发者之一,参与预训练与推理系统设计 | 部分属实,但角色描述不准确 Jack Rae 确实是 GPT-4 的贡献者之一,但其官方认证的角色是"研究基准测试执行负责人"(Research benchmark execution lead),而非报告中所述的"参与预训练与推理系统设计"。 | "Core contributors ... Jack Rae Research benchmark execution lead" https://openai.com/contributions/gpt-4/ (未知日期) |
提出"压缩即智能"理论:认为大模型训练本质是对数据的无损压缩,压缩越高效,模型越智能 | 归因不当 将该理论归功于Jack Rae个人是错误的。"压缩即智能"的核心思想源于20世纪60年代由Ray Solomonoff、Andrey Kolmogorov等人独立提出的算法信息论。Jack Rae是这一经典理论在大型语言模型时代的重要倡导者和实践者,他通过在斯坦福大学的演讲等活动,有力地推广了"将语言建模视为压缩"的观点,但并非该理论的"提出者"。 | 1. "Algorithmic complexity was introduced independently by Solomonoff (1964) Kolmogorov (1965) and Chaitin (1966)." https://arxiv.org/pdf/2504.18568 (2025-04-22) 2. "Ray was the first inventor of Algorithmic Information Theory which deals with the shortest effective description length of objects and is commonly designated by the term 'Kolmogorov complexity.'" https://www.mdpi.com/1999-4893/3/3/260 (2010-07-20) 3. "在最近OpenAI的学术分享会中,Jack Rae提出了一个重要的论断: 大语言模型的本质,其实是一个性能强大的数据无损压缩器。" https://tech.ifeng.com/c/8OsTZsn6mGZ (2023-04-10) 4. "2月28日,OpenAI 的核心研发人员 Jack Rae 在参加 Stanford MLSys Seminar 的访谈时进行了一个名为 Compression for AGI 的主题分享..." https://weibo.com/ttarticle/p/show?id=2309634884017727013209 (2023-03-27) |
职位:超级智能实验室(Meta Superintelligence Labs)核心成员 | 基本属实 多家媒体报道证实Meta确实成立了名为"Meta Superintelligence Labs"的新部门,且Jack Rae是被招募的11位AI顶尖人才之一。但关于他在该实验室中的具体职位,目前公开信息有限。 | 1. "美东时间周一,Meta公司CEO扎克伯格表示,他正在创建Meta超级智能实验室(Meta Superintelligence Labs)。" https://finance.eastmoney.com/a/202507013444586284.html (2025-07-01) 2. "在过去几周里,Meta从OpenAI、Anthropic和Google等竞争对手手中共招募了11位AI顶尖人才,几乎覆盖了当前主流大模型的全部研发脉络:...DeepMind的Jack Rae和Pei Sun,曾负责Gemini模型和多模态推理系统;" https://k.sina.cn/article_6890756658_19ab89a3204001fk8q.html (2025-07-01) 3. "Meta超级智能实验室是2025年6月建立的实验室,由马克·扎克伯格重组建立,机构由Alexandr Wang领导,担任首席人工智能官。...机构人员:Jack Rae、Pei Sun、特拉皮特·班萨尔、Huiwen Chang、Ji Lin、Jiahui Yu、Shuchao Bi、Shengjia ZhaoHongyu Ren、乔尔·波巴(Joel Pobar)等。" https://baike.baidu.com/item/Meta%E8%B6%85%E7%BA%A7%E6%99%BA%E8%83%BD%E5%AE%9E%E9%AA%8C%E5%AE%A4/65831818 (2025-07-21) |
背景:被扎克伯格亲自挖角,成为Meta组建AGI团队的首批关键人物之一 | 基本属实 多家科技媒体报道了此事,并归因于扎克伯格的亲自招募。 | "据外媒彭博社今日报道,Meta公司正从谷歌、Sesame AI等科技公司挖走顶尖工程师,谷歌DeepMind的首席研究员杰克·雷(Jack Rae)已离开谷歌加入Meta。" https://m.zhidx.com/p/485447.html (2025-06-12) |