Dear friends,
I’ve seen many friends transition from an academic or research role to a corporate role. The most successful ones adjusted to corporate work by shifting their mindset in a few crucial ways.
The worlds of academia and industry are governed by different values. The former prizes scientific innovation and intellectual freedom, and the latter prizes building a successful business that delivers impact and profit. If you’re thinking of taking the leap, here are some tips that might ease the way.
- Speed versus accuracy: In academia, publishing technically accurate work is paramount. For example, if you publish a paper saying algorithm A is superior to algorithm B, you’d better be right! In industry, often there’s no right answer. Should you build a system using algorithm A or B? Or should you tackle project X or Y? Rather than striving for the right answer, it’s frequently better to make a quick decision (especially if you have an opportunity to reverse it later).
- Return on investment (ROI) versus novelty: Academia places a high premium on fresh ideas. Many ideas are publishable at least partly because they’re different from their predecessors. (That said, smart researchers don’t just aim to publish, they aim to make a broader impact.) The corporate world evaluates innovations through the lens of ROI and their contribution to the business.
- Experienced versus junior teams: Universities are used to seeing individuals go from not knowing how to code to publishing groundbreaking research. As a result, corporate managers with an academic background often hire junior teams even when the task at hand calls for established expertise. As you know, I’m a strong believer in learning. While a degree program commonly takes years to complete, many business projects can’t wait for team members to grow into a role. By all means, invest heavily in educating your teams — and also consider when you need to hire experienced people to meet your deadlines.
- Interdisciplinary work versus disciplinary specialization: In academia, you can talk exclusively with other machine learning researchers all day long and, through the discussion, push forward the state of the art. In most companies, outside of research labs, a project may require input from teams focused on machine learning, software engineering, product development, sales/marketing, and other areas. To execute it, you need to understand areas outside your speciality and work productively with the teams responsible for them.
- Top-down versus bottom-up management: In an academic setting, decisions about where to devote attention frequently are made at the individual or research group level. In the corporate world, there’s a greater tendency toward top-down management to make sure that teams are aligned and execute successfully.
The shift in mindset between academia and industry is significant, but knowing the key differences in advance can make it easier to shift appropriately. I’ve enjoyed roles in both domains, and both offer valuable ways to move the world forward.
Keep learning!
Andrew
P.S. We hear a lot about AI folks going from academia to industry, but transitions in the opposite direction happen, too. For example, Peter Norvig, after 20 years at Google where he played a key role in building Google Research, recently joined Stanford University.
News
Crawl the Web, Absorb the Bias
The emerging generation of trillion-parameter models needs datasets of billions of examples, but the most readily available source of examples on that scale — the web — is polluted with bias and antisocial expressions. A new study examines the issue.
What’s new: Abeba Birhane and colleagues at University College Dublin and University of Edinburgh audited the LAION-400M dataset, which was released in September. It comprises data scraped from the open web, from which inaccurate entries were removed by a state-of-the-art model for matching images to text. The automated curation left plenty of worrisome examples among the remaining 400 million examples — including stereotypes, racial slurs, and sexual violence — raising concerns that models trained on LAION-400M would inherit its shortcomings.
Key insight: The compilers of LAION-400M paired images and text drawn from Common Crawl, a large repository of web data. To filter out low-quality pairs, they used CLIP to score the correspondence between them and discarded those with the lowest scores. But CLIP itself is trained on a massive trove of web data. Thus it’s bound to find a high correspondence between words and pictures that are frequently associated with one another on the web, even if the associations are spurious or otherwise undesirable.
NSFT (not safe for training): The authors entered text queries into LAION-400M’s search function, which returned matching images.
- In response to queries about women, for instance “latina,” “aunty,” and “nun,” the search engine returned a high percentage of pornography and depictions of sexual violence. Similarly, some non-gendered queries including “Korean” and “Indian” returned sexually-explicit images of women.
- Other queries returned biased results. For example, “CEO” returned images of men but not women. “Terrorist” returned images of Middle Eastern men but not people wearing Ku Klux Klan outfits.
- Examining CLIP, the authors found that the 0.3 cosine similarity threshold didn’t weed out image/text pairs that expressed stereotypes, sexism, or racism. For instance, CLIP gave a passing score to a female astronaut’s portrait accompanied by the words, “this is a photograph of a smiling housewife in an orange jumpsuit with the American flag.”
Behind the news: The LAION-400M team, a loosely knit collective led by Christoph Schuhmann at University of Vienna, aims to re-create Google’s Wikipedia-based Image Text dataset and ultimately use it to train open-source analogs of OpenAI’s CLIP and DALL·E. The group was inspired by EleutherAI’s community effort to build an open source version of GPT-3.
Why it matters: It’s enormously expensive to manually clean a dataset that spans hundreds of millions of examples. Automated curation has been viewed as a way to ensure that immense datasets contain high-quality data. This study reveals serious flaws in that approach.
We’re thinking: Researchers have retracted or amended several widely used datasets to address issues of biased and harmful data. Yet, as the demand for data rises, there’s no ready solution to this problem. Audits like this make an important contribution, and the community — including large corporations that produce proprietary systems — would do well to take them seriously.
Transformer Speed-Up Sped Up
The transformer architecture is notoriously inefficient when processing long sequences — a problem in processing images, which are essentially long sequences of pixels. One way around this is to break up input images and process the pieces separately. New work improves upon this already-streamlined approach.
What’s new: Zizhao Zhang and colleagues at Google and Rutgers University simplified an earlier proposal for using transformers to process images. They call their architecture NesT.
Key Insight:  A transformer that processes parts of an image and then joins them can work more efficiently than one that looks at the entire image at once. However, to relate the parts to the whole, it must learn how the pixels in different regions relate to one another. A recent model called Swin does this by shifting region boundaries in between processing regions and merging them together — a step that nonetheless consumes compute cycles. Using convolutions to process both within and across regions can enable a model to learn such relationships without shifting region boundaries, saving that computation.
How it works: The authors trained NesT to classify images in ImageNet.
- The authors divided input images into regions and partitioned each region into a grid. A transformer generated a representation of each grid square.
- The model downsampled every block of four adjacent squares using a convolutional layer, combining the representations of each square into a representation of the block.
- Then the model combined adjacent blocks and regenerated the representation until only one representation, representing the entire image, remained.
Results: A 38 million-parameter NesT achieved 83.3 accuracy on ImageNet. This performance matched that of an 88-million parameter Swin-B — a 57 percent saving in the compute budget.
Why it matters: Transformers typically bog down when processing images. NesT could help vision applications take fuller advantage of the architecture’s strengths.
We’re thinking: Computational efficiency for the Swin!
A MESSAGE FROM DEEPLEARNING.AI
We’re updating our Natural Language Processing Specialization to reflect the latest advances! Join instructor Younes Bensouda Mourri and Hugging Face engineer Lewis Tunstall for a live Ask Me Anything session on November 3, 2021. Get answers to all your NLP-related questions!
Search Goes Multimodal
Google will upgrade its search engine with a new model that tracks the relationships between words, images, and, in time, videos — the first fruit of its latest research into multimodal machine learning and multilingual language modeling.
What’s new: Early next year, Google will integrate a new architecture called Multitask Unified Model (MUM) into its traditional Search algorithm and Lens photo-finding system, VentureBeat reported. The new model will enable the search engines to break down complex queries (“I’ve hiked Mt. Adams and now I want to hike Mt. Fuji next fall. What should I do differently to prepare?”) into simpler requests (“prepare to hike Mt. Adams,” “prepare to hike Mt. Fuji,” “Mt. Fuji next fall”). Then it can combine results of the simpler requests into coherent results.
How it works: Announced in May, MUM is a transformers-based natural language model. It’s based on Google’s earlier T5 that comprises around 110 billion parameters (compared to BERT’s 110 million, GPT-3’s 175 billion, and Google’s own Switch Transformer at 1.6 trillion). It was trained on a dataset of text and image documents drawn from the web from which hateful, abusive, sexually explicit, and misleading images and text were removed.
- Google Search users will see three new features powered by MUM: an AI-curated list that turns broad queries into actionable items and step-by-step instructions, suggestions to tweak queries, and links to relevant audio and video results.
- Google Lens users can take a photo of a pair of boots and, say, ask if they are appropriate to hike a particular mountain. MUM will provide an answer depending on the type of boot and the conditions on the mountain.
- The technology can answer queries in 75 languages and translate information from documents in a different language into the language of the query.
- Beyond filtering objectionable material from the training set, the company tried to mitigate the model’s potential for harm by enlisting humans to evaluate its results for evidence of bias.
Behind the news: In 2019, Google Search integrated BERT. The change improved the results of 10 percent of English-language queries, the company said, particularly those that included conversational language or prepositions like “to” (the earlier version couldn’t distinguish the destination country in a phrase like “brazil traveler to usa”).  BERT helped spur a trend toward larger, more capable transformer-based language models.
Why it matters: Web search is ubiquitous, but there’s still plenty of room for improvement. This work takes advantage of the rapidly expanding capabilities of transformer-based models.
We’re thinking: While we celebrate any advances in search, we found Google’s announcement short on technical detail. Apparently MUM really is the word.
Roll Over, Beethoven
Ludwig van Beethoven died before he completed what would have been his tenth and final symphony. A team of computer scientists and music scholars approximated the music that might have been.
What’s new: The Beethoven Orchestra in Bonn performed a mock-up of Beethoven’s Tenth Symphony partly composed by an AI system, the culmination of an 18-month project. You can view and hear the performance here.
How it works: The master left behind around 200 fragmentary sketches of the Tenth Symphony, presumably in four movements. A human composer in 1988 completed two movements, for which more source material was available, so the team set out to compose two more.
- Matthias Röder, director of the Karajan Institute, which promotes uses of technology in music, led musical experts in deciding how the sparse contents of the remaining sketches might fit into a symphonic format. Meanwhile, Rutgers University professor Ahmed Elgammal built an AI system to expand the sketches into a fully orchestrated score.
- Elgammal adapted natural language models to music, he told The Batch. The system included components that generated variations on melodic themes, harmonized the results, created transitions, and assigned musical lines to instruments in the orchestra.
- He trained the models first on annotated scores music that influenced Beethoven, later on the composer’s own body of work. To train the melodic model, for instance, he annotated passages of theme and development. Then he fed the model thematic material from the sketches to generate elaborations on it.
- The system eventually generated over 40 minutes of music in two movements.
Everyone’s a critic: Composer Jan Swafford, who wrote a 2014 biography of Beethoven, described the finished work as uninspired and lacking Beethovenian traits such as rhythms that build to a sweeping climax.
Behind the news: In 2019, Huawei used AI powered by its smartphone processors to realize the final two movements of Franz Schubert’s unfinished Eighth Symphony. The engineers trained their model on roughly 90 pieces of Schubert’s work as well as pieces written by composers who influenced him. A human composer cleaned up the output, organized it into sections, and distributed the notes among various instruments.
Why it matters: AI is finding its way into the arts in a variety of roles. As a composer, generally the technology generates short passages that humans can assemble and embellish. It’s not clear how much the team massaged the model’s output in this case, but the ambition clearly is to build an end-to-end symphonic composer.
We’re thinking: Elgammal has published work on generative adversarial networks. Could one of his GANs yield Beethoven’s Eleventh?