预备调查清单
待核查内容1:LeCun 在《深度学习》一书中表示 CNN 不适用于自然图像。
行动计划
Search
深度学习 书籍 Yann LeCun 作者我需要核查引用"LeCun 在《深度学习》一书中表示 CNN 不适用于自然图像"的准确性。我将系统地进行调查。
Search
Deep Learning book Goodfellow Bengio Courville Yann LeCun我需要更深入地了解Yann LeCun与《深度学习》这本书的关系,以及他对CNN与自然图像的观点。
Search
LeCun CNN 自然图像 不适用 Deep Learning book我需要进一步了解Yann LeCun对CNN在自然图像处理方面的观点,特别是查找是否有他表示"CNN不适用于自然图像"的言论。
Search
Yann LeCun CNN natural images limitations 'not suitable'已核查结论清单
已核查内容1:Yann LeCun 是《深度学习》一书的作者。
已核查内容2:Yann LeCun 认为 CNN 不适用于自然图像。
待补充核查清单
待核查内容1:原始引述“LeCun 在《深度学习》一书中表示 CNN 不适用于自然图像”的确切来源。
行动计划
Search
LeCun Bengio Hinton Deep Learning Nature paper CNN limitations natural images我需要查找LeCun、Bengio和Hinton在《Nature》上发表的深度学习综述文章,特别是关于CNN与自然图像的内容。
Search
LeCun Bengio Hinton Nature Deep Learning CNN natural images我需要查看LeCun、Bengio和Hinton在Nature上发表的深度学习综述文章的原文,以核实引用内容。
Search
LeCun Bengio Hinton Nature Deep Learning PDF我需要直接访问LeCun、Bengio和Hinton在Nature上发表的深度学习综述文章,以核实引用内容。
1Facebook AI Research, 770 Broadway, New York, New York 10003 USA. 2New York University, 715 Broadway, New York, New York 10003, USA. 3Department of Computer Science and Operations
Research Université de Montréal, Pavillon André-Aisenstadt, PO Box 6128 Centre-Ville STN Montréal, Quebec H3C 3J7, Canada. 4Google, 1600 Amphitheatre Parkway, Mountain View, California
94043, USA. 5Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario M5S 3G4, Canada.
M
achine-learning technology powers many aspects of modern
society: from web searches to content filtering on social net-
works to recommendations on e-commerce websites, and
it is increasingly present in consumer products such as cameras and
smartphones. Machine-learning systems are used to identify objects
in images, transcribe speech into text, match news items, posts or
products with users’ interests, and select relevant results of search.
Increasingly, these applications make use of a class of techniques called
deep learning.
Conventional machine-learning techniques were limited in their
ability to process natural data in their raw form. For decades, con-
structing a pattern-recognition or machine-learning system required
careful engineering and considerable domain expertise to design a fea-
ture extractor that transformed the raw data (such as the pixel values
of an image) into a suitable internal representation or feature vector
from which the learning subsystem, often a classifier, could detect or
classify patterns in the input.
Representation learning is a set of methods that allows a machine to
be fed with raw data and to automatically discover the representations
needed for detection or classification. Deep-learning methods are
representation-learning methods with multiple levels of representa-
tion, obtained by composing simple but non-linear modules that each
transform the representation at one level (starting with the raw input)
into a representation at a higher, slightly more abstract level. With the
composition of enough such transformations, very complex functions
can be learned. For classification tasks, higher layers of representation
amplify aspects of the input that are important for discrimination and
suppress irrelevant variations. An image, for example, comes in the
form of an array of pixel values, and the learned features in the first
layer of representation typically represent the presence or absence of
edges at particular orientations and locations in the image. The second
layer typically detects motifs by spotting particular arrangements of
edges, regardless of small variations in the edge positions. The third
layer may assemble motifs into larger combinations that correspond
to parts of familiar objects, and subsequent layers would detect objects
as combinations of these parts. The key aspect of deep learning is that
these layers of features are not designed by human engineers: they
are learned from data using a general-purpose learning procedure.
Deep learning is making major advances in solving problems that
have resisted the best attempts of the artificial intelligence commu-
nity for many years. It has turned out to be very good at discovering
intricate structures in high-dimensional data and is therefore applica-
ble to many domains of science, business and government. In addition
to beating records in image recognition1–4 and speech recognition5–7, it
has beaten other machine-learning techniques at predicting the activ-
ity of potential drug molecules8, analysing particle accelerator data9,10,
reconstructing brain circuits11, and predicting the effects of mutations
in non-coding DNA on gene expression and disease12,13. Perhaps more
surprisingly, deep learning has produced extremely promising results
for various tasks in natural language understanding14, particularly
topic classification, sentiment analysis, question answering15 and lan-
guage translation16,17.
We think that deep learning will have many more successes in the
near future because it requires very little engineering by hand, so it
can easily take advantage of increases in the amount of available com-
putation and data. New learning algorithms and architectures that are
currently being developed for deep neural networks will only acceler-
ate this progress.
Supervised learning
The most common form of machine learning, deep or not, is super-
vised learning. Imagine that we want to build a system that can classify
images as containing, say, a house, a car, a person or a pet. We first
collect a large data set of images of houses, cars, people and pets, each
labelled with its category. During training, the machine is shown an
image and produces an output in the form of a vector of scores, one
for each category. We want the desired category to have the highest
score of all categories, but this is unlikely to happen before training.
We compute an objective function that measures the error (or dis-
tance) between the output scores and the desired pattern of scores. The
machine then modifies its internal adjustable parameters to reduce
this error. These adjustable parameters, often called weights, are real
numbers that can be seen as ‘knobs’ that define the input–output func-
tion of the machine. In a typical deep-learning system, there may be
hundreds of millions of these adjustable weights, and hundreds of
millions of labelled examples with which to train the machine.
To properly adjust the weight vector, the learning algorithm com-
putes a gradient vector that, for each weight, indicates by what amount
the error would increase or decrease if the weight were increased by a
tiny amount. The weight vector is then adjusted in the opposite direc-
tion to the gradient vector.
The objective function, averaged over all the training examples, can
Deep learning allows computational models that are composed of multiple processing layers to learn representations of
data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech rec-
ognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep
learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine
should change its internal parameters that are used to compute the representation in each layer from the representation in
the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and
audio, whereas recurrent nets have shone light on sequential data such as text and speech.
Deep learning
Yann LeCun1,2, Yoshua Bengio3 & Geoffrey Hinton4,5
4 3 6 | N A T U R E | V O L 5 2 1 | 2 8 M A Y 2 0 1 5
REVIEW
doi:10.1038/nature14539
© 2015 Macmillan Publishers Limited. All rights reserved
be seen as a kind of hilly landscape in the high-dimensional space of
weight values. The negative gradient vector indicates the direction
of steepest descent in this landscape, taking it closer to a minimum,
where the output error is low on average.
In practice, most practitioners use a procedure called stochastic
gradient descent (SGD). This consists of showing the input vector
for a few examples, computing the outputs and the errors, computing
the average gradient for those examples, and adjusting the weights
accordingly. The process is repeated for many small sets of examples
from the training set until the average of the objective function stops
decreasing. It is called stochastic because each small set of examples
gives a noisy estimate of the average gradient over all examples. This
simple procedure usually finds a good set of weights surprisingly
quickly when compared with far more elaborate optimization tech-
niques18. After training, the performance of the system is measured
on a different set of examples called a test set. This serves to test the
generalization ability of the machine — its ability to produce sensible
answers on new inputs that it has never seen during training.
Many of the current practical applications of machine learning use
linear classifiers on top of hand-engineered features. A two-class linear
classifier computes a weighted sum of the feature vector components.
If the weighted sum is above a threshold, the input is classified as
belonging to a particular category.
Since the 1960s we have known that linear classifiers can only carve
their input space into very simple regions, namely half-spaces sepa-
rated by a hyperplane19. But problems such as image and speech recog-
nition require the input–output function to be insensitive to irrelevant
variations of the input, such as variations in position, orientation or
illumination of an object, or variations in the pitch or accent of speech,
while being very sensitive to particular minute variations (for example,
the difference between a white wolf and a breed of wolf-like white
dog called a Samoyed). At the pixel level, images of two Samoyeds in
different poses and in different environments may be very different
from each other, whereas two images of a Samoyed and a wolf in the
same position and on similar backgrounds may be very similar to each
other. A linear classifier, or any other ‘shallow’ classifier operating on
Figure 1 | Multilayer neural networks and backpropagation. a, A multi-
layer neural network (shown by the connected dots) can distort the input
space to make the classes of data (examples of which are on the red and
blue lines) linearly separable. Note how a regular grid (shown on the left)
in input space is also transformed (shown in the middle panel) by hidden
units. This is an illustrative example with only two input units, two hidden
units and one output unit, but the networks used for object recognition
or natural language processing contain tens or hundreds of thousands of
units. Reproduced with permission from C. Olah (http://colah.github.io/).
b, The chain rule of derivatives tells us how two small effects (that of a small
change of x on y, and that of y on z) are composed. A small change Δx in
x gets transformed first into a small change Δy in y by getting multiplied
by ∂y/∂x (that is, the definition of partial derivative). Similarly, the change
Δy creates a change Δz in z. Substituting one equation into the other
gives the chain rule of derivatives — how Δx gets turned into Δz through
multiplication by the product of ∂y/∂x and ∂z/∂x. It also works when x,
y and z are vectors (and the derivatives are Jacobian matrices). c, The
equations used for computing the forward pass in a neural net with two
hidden layers and one output layer, each constituting a module through
which one can backpropagate gradients. At each layer, we first compute
the total input z to each unit, which is a weighted sum of the outputs of
the units in the layer below. Then a non-linear function f(.) is applied to
z to get the output of the unit. For simplicity, we have omitted bias terms.
The non-linear functions used in neural networks include the rectified
linear unit (ReLU) f(z) = max(0,z), commonly used in recent years, as
well as the more conventional sigmoids, such as the hyberbolic tangent,
f(z) = (exp(z) − exp(−z))/(exp(z) + exp(−z)) and logistic function logistic,
f(z) = 1/(1 + exp(−z)). d, The equations used for computing the backward
pass. At each hidden layer we compute the error derivative with respect to
the output of each unit, which is a weighted sum of the error derivatives
with respect to the total inputs to the units in the layer above. We then
convert the error derivative with respect to the output into the error
derivative with respect to the input by multiplying it by the gradient of f(z).
At the output layer, the error derivative with respect to the output of a unit
is computed by differentiating the cost function. This gives yl − tl if the cost
function for unit l is 0.5(yl − tl)2, where tl is the target value. Once the ∂E/∂zk
is known, the error-derivative for the weight wjk on the connection from
unit j in the layer below is just yj ∂E/∂zk.
Input
(2)
Output
(1 sigmoid)
Hidden
(2 sigmoid)
a
b
d
c
y
y
x
y
x
=
y
z
x
y
z
y
z
z
y
=
Δ
Δ
Δ
Δ
Δ
Δ
z
y
z
x
y
x
=
x
z
y
z
x
x
y
=
Compare outputs with correct
answer to get error derivatives
j
k
E
yl
=yl
tl
E
zl
= E
yl
yl
zl
l
E
yj
=
wjk
E
zk
E
zj
= E
yj
yj
zj
E
yk
=
wkl
E
zl
E
zk
= E
yk
yk
zk
wkl
wjk
wij
i
j
k
yl = f (zl)
zl =
wkl yk
l
yj = f (zj)
zj =
wij xi
yk = f (zk)
zk =
wjk yj
Output units
Input units
Hidden units H2
Hidden units H1
wkl
wjk
wij
k H2
k H2
I out
j H1
i Input
i
2 8 M A Y 2 0 1 5 | V O L 5 2 1 | N A T U R E | 4 3 7
REVIEW INSIGHT
© 2015 Macmillan Publishers Limited. All rights reserved
raw pixels could not possibly distinguish the latter two, while putting
the former two in the same category. This is why shallow classifiers
require a good feature extractor that solves the selectivity–invariance
dilemma — one that produces representations that are selective to
the aspects of the image that are important for discrimination, but
that are invariant to irrelevant aspects such as the pose of the animal.
To make classifiers more powerful, one can use generic non-linear
features, as with kernel methods20, but generic features such as those
arising with the Gaussian kernel do not allow the learner to general-
ize well far from the training examples21. The conventional option is
to hand design good feature extractors, which requires a consider-
able amount of engineering skill and domain expertise. But this can
all be avoided if good features can be learned automatically using a
general-purpose learning procedure. This is the key advantage of
deep learning.
A deep-learning architecture is a multilayer stack of simple mod-
ules, all (or most) of which are subject to learning, and many of which
compute non-linear input–output mappings. Each module in the
stack transforms its input to increase both the selectivity and the
invariance of the representation. With multiple non-linear layers, say
a depth of 5 to 20, a system can implement extremely intricate func-
tions of its inputs that are simultaneously sensitive to minute details
— distinguishing Samoyeds from white wolves — and insensitive to
large irrelevant variations such as the background, pose, lighting and
surrounding objects.
Backpropagation to train multilayer architectures
From the earliest days of pattern recognition22,23, the aim of research-
ers has been to replace hand-engineered features with trainable
multilayer networks, but despite its simplicity, the solution was not
widely understood until the mid 1980s. As it turns out, multilayer
architectures can be trained by simple stochastic gradient descent.
As long as the modules are relatively smooth functions of their inputs
and of their internal weights, one can compute gradients using the
backpropagation procedure. The idea that this could be done, and
that it worked, was discovered independently by several different
groups during the 1970s and 1980s24–27.
The backpropagation procedure to compute the gradient of an
objective function with respect to the weights of a multilayer stack
of modules is nothing more than a practical application of the chain
rule for derivatives. The key insight is that the derivative (or gradi-
ent) of the objective with respect to the input of a module can be
computed by working backwards from the gradient with respect to
the output of that module (or the input of the subsequent module)
(Fig. 1). The backpropagation equation can be applied repeatedly to
propagate gradients through all modules, starting from the output
at the top (where the network produces its prediction) all the way to
the bottom (where the external input is fed). Once these gradients
have been computed, it is straightforward to compute the gradients
with respect to the weights of each module.
Many applications of deep learning use feedforward neural net-
work architectures (Fig. 1), which learn to map a fixed-size input
(for example, an image) to a fixed-size output (for example, a prob-
ability for each of several categories). To go from one layer to the
next, a set of units compute a weighted sum of their inputs from the
previous layer and pass the result through a non-linear function. At
present, the most popular non-linear function is the rectified linear
unit (ReLU), which is simply the half-wave rectifier f(z) = max(z, 0).
In past decades, neural nets used smoother non-linearities, such as
tanh(z) or 1/(1 + exp(−z)), but the ReLU typically learns much faster
in networks with many layers, allowing training of a deep supervised
network without unsupervised pre-training28. Units that are not in
the input or output layer are conventionally called hidden units. The
hidden layers can be seen as distorting the input in a non-linear way
so that categories become linearly separable by the last layer (Fig. 1).
In the late 1990s, neural nets and backpropagation were largely
forsaken by the machine-learning community and ignored by the
computer-vision and speech-recognition communities. It was widely
thought that learning useful, multistage, feature extractors with lit-
tle prior knowledge was infeasible. In particular, it was commonly
thought that simple gradient descent would get trapped in poor local
minima — weight configurations for which no small change would
reduce the average error.
In practice, poor local minima are rarely a problem with large net-
works. Regardless of the initial conditions, the system nearly always
reaches solutions of very similar quality. Recent theoretical and
empirical results strongly suggest that local minima are not a serious
issue in general. Instead, the landscape is packed with a combinato-
rially large number of saddle points where the gradient is zero, and
the surface curves up in most dimensions and curves down in the
Figure 2 | Inside a convolutional network. The outputs (not the filters)
of each layer (horizontally) of a typical convolutional network architecture
applied to the image of a Samoyed dog (bottom left; and RGB (red, green,
blue) inputs, bottom right). Each rectangular image is a feature map
corresponding to the output for one of the learned features, detected at each
of the image positions. Information flows bottom up, with lower-level features
acting as oriented edge detectors, and a score is computed for each image class
in output. ReLU, rectified linear unit.
Red
Green
Blue
Samoyed (16); Papillon (5.7); Pomeranian (2.7); Arctic fox (1.0); Eskimo dog (0.6); white wolf (0.4); Siberian husky (0.4)
Convolutions and ReLU
Max pooling
Max pooling
Convolutions and ReLU
Convolutions and ReLU
4 3 8 | N A T U R E | V O L 5 2 1 | 2 8 M A Y 2 0 1 5
REVIEW
INSIGHT
© 2015 Macmillan Publishers Limited. All rights reserved
remainder29,30. The analysis seems to show that saddle points with
only a few downward curving directions are present in very large
numbers, but almost all of them have very similar values of the objec-
tive function. Hence, it does not much matter which of these saddle
points the algorithm gets stuck at.
Interest in deep feedforward networks was revived around 2006
(refs 31–34) by a group of ...
已核查结论清单
已核查内容1:LeCun 在《深度学习》一书中表示 CNN 不适用于自然图像。
待补充核查清单
待核查内容1:在 LeCun、Bengio 和 Hinton 合著的《Nature》论文《Deep Learning》全文中,是否存在任何关于 CNN 在自然图像应用上局限性的特定上下文陈述,可能被断章取义或误解为“不适用”。
行动计划
本文介绍了深度学习技术,特别是卷积神经网络(CNN)在图像处理领域的应用和原理。文章首先阐述了传统机器学习技术在处理原始数据方面的局限性,指出它们需要精心设计的特征提取器。相比之下,深度学习方法能够自动从原始数据中发现所需的表示,通过多层非线性模块组合,逐层提取更抽象的特征。在图像处理中,CNN的各层分别识别边缘、图案组合、物体部件直至完整物体。文章强调深度学习的关键在于这些特征层不是由工程师设计,而是通过通用学习程序从数据中学习得到。文章还介绍了监督学习的基本原理,包括目标函数、梯度下降和反向传播算法。作者指出,深度学习已在图像识别、语音识别、自然语言处理等多个领域取得突破性进展,并预测随着计算能力和数据量的增加,深度学习将在未来取得更多成功。
文章中提到了关于CNN处理自然图像的几个限制和挑战:
选择性-不变性困境:文章指出在图像识别中存在一个核心问题 - 系统需要对图像中重要的特征保持选择性,同时对不相关的变化(如位置、方向或照明条件)保持不变性。例如,在像素级别上,两只不同姿势和环境中的萨摩耶犬可能看起来非常不同,而同一位置和相似背景下的萨摩耶犬和狼的图像可能非常相似。这种困境对传统的线性分类器或浅层分类器构成了挑战。
原始像素处理的局限性:文章明确指出,"在像素级别上操作的线性分类器或任何其他'浅层'分类器不可能区分后者两个(萨摩耶犬和狼),同时将前者两个(两只不同姿势的萨摩耶犬)归为同一类别"。这表明直接处理原始像素数据存在固有限制。
需要多层非线性处理:文章强调,为了解决上述问题,系统需要"多个非线性层,比如5到20层深度",才能实现"对输入的极其复杂的函数,这些函数同时对微小细节敏感——区分萨摩耶犬和白狼——并对大量不相关变化如背景、姿势、照明和周围物体不敏感"。这暗示了浅层网络在处理复杂自然图像时的局限性。
传统观点中的局限性:文章提到"在1990年代末,神经网络和反向传播在很大程度上被机器学习社区所放弃,并被计算机视觉和语音识别社区所忽视。人们普遍认为,在几乎没有先验知识的情况下学习有用的多阶段特征提取器是不可行的"。这反映了当时对CNN处理复杂图像能力的怀疑。
虽然文章主要强调了CNN在图像处理方面的成功,但这些提到的挑战和早期的限制性观点为理解CNN在处理自然图像时可能面临的困难提供了线索。
由于提供的文本中没有包含网页链接,无法提取相关链接。
文章中包含两张与CNN处理自然图像相关的图片:
图片1: 多层神经网络和反向传播
图片2: 卷积网络内部
这两张图片直接展示了CNN如何处理自然图像,特别是图2展示了CNN如何从低级特征(如边缘检测)逐步提取更高级的特征,最终实现图像分类。
我需要进一步查看LeCun的观点,特别是他关于CNN和自然图像的最新观点。
Search
Yann LeCun CNN vs ViT natural images recent views 2023 2024我需要查看LeCun在《Nature》发表的深度学习综述文章中关于CNN和自然图像的具体内容。
本文是由Yann LeCun、Yoshua Bengio和Geoffrey Hinton撰写的关于深度学习的综述文章。文章介绍了深度学习作为一种能够从原始数据中自动发现表示的方法,通过多层非线性处理模块组合来学习数据的多层次抽象表示。文章指出,传统机器学习技术在处理原始形式的自然数据方面能力有限,而深度学习通过多层表示学习克服了这一限制。文章详细解释了深度学习的工作原理,包括如何通过多层特征提取逐渐将原始输入(如图像像素)转化为更抽象的表示,以及如何使用反向传播算法训练多层网络。文章强调深度学习在图像识别、语音识别、自然语言处理等多个领域取得了突破性进展,并介绍了监督学习的基本概念、随机梯度下降等训练方法。文章还讨论了卷积神经网络(CNN)的结构和工作原理,展示了它如何通过层次化特征提取处理图像数据。
在查阅整篇文章后,我没有发现LeCun明确表示CNN不适用于自然图像的内容。相反,文章中有多处内容表明CNN在处理自然图像方面非常有效:
文章指出深度学习在图像识别领域取得了显著成果:"Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years... In addition to beating records in image recognition..."
文章详细介绍了CNN如何处理图像数据,特别是图2展示了CNN如何处理一张萨摩耶犬的自然图像,通过多层特征提取最终实现分类。
文章解释了深度学习如何解决自然图像处理中的"选择性-不变性困境"(selectivity–invariance dilemma),即如何在保持对重要特征敏感的同时,对无关变化(如姿势、光照等)保持不变性。
文章描述了CNN如何通过多层非线性变换处理图像:"With multiple non-linear layers, say a depth of 5 to 20, a system can implement extremely intricate functions of its inputs that are simultaneously sensitive to minute details — distinguishing Samoyeds from white wolves — and insensitive to large irrelevant variations such as the background, pose, lighting and surrounding objects."
文章还提到CNN在图像、视频、语音和音频处理方面带来了突破性进展:"Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio..."
综上所述,LeCun在文章中不仅没有表示CNN不适用于自然图像,反而强调了CNN在处理自然图像方面的优越性和有效性。文章通过具体例子(如萨摩耶犬图像的处理)展示了CNN如何成功处理自然图像。
网页中没有提供可点击的链接。
图片1:
图片2:
我已经收集了足够的信息来核查引用的准确性。现在我将提供最终答案。
本次核查共涉及1项内容,经查证该内容严重不实。
核查发现,引述"LeCun在《深度学习》一书中表示CNN不适用于自然图像"存在两个根本性错误:首先,Yann LeCun并非《深度学习》(通常指Goodfellow等人所著的"花书")的作者;其次,LeCun作为CNN领域的开创者,其学术观点与该引述完全相反,他一直强调并证明CNN在处理自然图像方面的有效性和优越性。
建议:不应在学术或其他正式场合引用此陈述,因为它严重误导读者对LeCun观点的理解。如需引用LeCun关于CNN的观点,应直接查阅其在《Nature》发表的论文或其他原始著作。
原文陈述 | 可靠性分析 | 证据 |
---|---|---|
LeCun在《深度学习》一书中表示CNN不适用于自然图像 | 严重不实 该陈述包含两个核心错误: 1. 作者归属错误:Yann LeCun并非被广泛称为《深度学习》的权威教材的作者 2. 观点归属错误:LeCun作为"CNN之父",其学术观点与该引述完全相反,他一直强调并证明CNN在处理自然图像方面的有效性和优越性 LeCun实际上是CNN的主要贡献者,其近期观点仅指出在处理高分辨率实时视频等特定任务时,纯ViT架构效率低下,并提倡使用卷积与自注意力结合的混合架构,这恰恰肯定了CNN在底层图像处理中的基础性作用。 | 1. "作者: [美] 伊恩·古德费洛 / [加] 约书亚·本吉奥 / [加] 亚伦·库维尔" https://book.douban.com/subject/27087503/ (2017-07-01) 2. "Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction... Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech." https://www.cs.toronto.edu/~hinton/absps/NatureDeepReview.pdf (2015-05-28) 3. "他(LeCun)想表达的是,ViT 太慢、效率太低,导致不适合实时处理高分辨率图像和视频任务...在坚持自注意力循环之前,你至少需要几个具有池化和步幅的卷积层。" https://zhuanlan.zhihu.com/p/701233453 (2024-06-03) 4. "杨立昆的主要贡献是大幅完善了卷积神经网络(CNN),并将其用于计算机视觉领域,极大地拓宽了应用范围。" https://zhuanlan.zhihu.com/p/397978083 (2021-08-09) |