Okay, the user wants me to translate a Chinese innovation work statement into English for MIT's 35 Under 35 application. This is clearly a high-stakes professional context where precision and technical accuracy matter.
First, I notice this isn't just any translation request - the content describes major AI research contributions. The user mentions ResNet (which I recognize as foundational in deep learning), ShuffleNet, and RepVGG architectures. They're clearly a top-tier AI researcher transitioning from academia to founding an AGI startup.
The text has several challenging aspects to translate:
I'll need to maintain:
Special attention needed for:
The most complex part is the AGI mission statement paragraph. I'll need to carefully unpack phrases like "生成和理解两大维度" into "generation and comprehension dimensions" while preserving their strategic vision.
For names and brands:
The tone should balance:
I'll structure the translation to mirror their career arc: academic roots → industry transition → startup phase → future outlook. The English version must sound like it was originally written by a native English-speaking AI researcher.
Here's the polished English translation of your innovation statement for the MIT 35 Under 35 application, maintaining technical accuracy and a compelling narrative flow:
Personal Innovation Statement
Deep neural networks form the cornerstone of modern artificial intelligence. Over the past decade, I have dedicated myself to researching the design, training, and optimization methodologies for general-purpose neural networks. My work consistently aims to enhance model practicality and intelligence, striving to make AI beneficial to all.
The Scaling Law dictates a positive correlation between model intelligence and the volume of data, model size, and computational scale. However, simply deepening or widening models encounters training instability or performance degradation. During my PhD at MSRA in 2015, I co-developed ResNet, introducing residual connections to overcome these limitations. This breakthrough enabled models to scale from 20 layers to over 100, even thousands of layers deep, solving depth-wise scaling for the first time. The same year, our complementary work on MSRA initialization identified widening instability as stemming from vanishing/exploding gradients and proposed an initialization scheme to mitigate this. The core concepts of these two works are now ubiquitous in nearly all neural networks.
Transitioning to the industrial sector in 2017, I shifted focus toward practical model deployment. A key challenge is achieving the performance of larger models with fewer parameters and computations, while ensuring efficient execution on inference hardware. Our representative work, ShuffleNet (v1/v2), introduced lightweight operations like Channel Shuffle, maximizing feature reuse and significantly reducing performance loss during model shrinkage. At Megvii, millions of smartphones incorporated ShuffleNet-based FaceID SDKs in 2018 alone, enabling millisecond-level facial unlocking and authentication for mobile users.
Another major practical hurdle lies in the tension between sophisticated model architectures and efficient hardware inference. In 2021, we proposed RepVGG, leveraging re-parameterization techniques to decouple training and inference phases. This allowed complex structures for high training accuracy but reverted to a simple, VGG-like architecture during inference for streamlined hardware deployment. In 2022, applying the same re-parameterization principle, we deeply analyzed existing Vision Transformers (ViTs) and proposed RepLKNet—a framework utilizing ultra-large convolutional kernels. RepLKNet outperformed mainstream ViTs while maintaining structural simplicity and ease of deployment.
In 2023, I co-founded StepFun, an AGI-focused company. Our mission is to explore Artificial General Intelligence. Our strategic roadmap begins with multi-modal fusion large models, scaling up along generation and comprehension dimensions to unify multi-modal understanding and creation. Combining environmental interaction and system-2 mechanisms will lead to a world model, ultimately realizing AGI. As the starting point, we introduced DreamLLM—one of the earliest frameworks for integrated multi-modal generation and understanding. Based on DreamLLM, we launched Step1V, China's first billion-parameter native multi-modal large model by the end of 2023. Launched nearly simultaneously with Google's Gemini 1.0, Step1V demonstrated significantly superior multi-modal understanding compared to contemporary vision-language decoupled architectures. Subsequently, we released Step2 (a trillion-parameter Mixture-of-Experts foundation model), StepVideo (a video generation model), Step1o (a multi-modal understanding model for image, text, and audio), and StepR-mini (a reasoning model).
Looking ahead, I remain committed to researching novel architectures and algorithms, continuously advancing the intelligence of AI systems, particularly in complex multi-modal reasoning, spatial intelligence, and online autonomous learning. I believe that as models approach AGI, artificial intelligence will undoubtedly bring profound benefits to humanity.
Key Translation Notes:
This translation effectively conveys your significant contributions to neural network architecture, model efficiency, and practical AI deployment, positioning you as a strong candidate for the MIT 35 Under 35 award. Good luck!