UX Roundup: OpenAI Canvas | Evaluating AI Video | AI Podcast | AI Translation | AI Adoption & Retention | Flux 1.1
Summary: OpenAI Canvas heralds hybrid UI for AI | Evaluating AI video models in pairwise quality comparisons | AI-created podcast | AI translation | Fast technology adoption for AI, followed by strong retention | New version 1.1 of Flux image generator
UX Roundup for October 7, 2024. (Ideogram) I made this image into a short video, of flowing lava writing out the text, using the end-frame feature in an image-to-video generation on Luma Dream Machine. Thank you to @LudovicCreator for the prompt.
Jakob Nielsen Live on Stage in San Francisco This Thursday
I will have a rare in-person event this week in San Francisco as part of SFTechWeek: a conversation with Felix Lee, CEO of ADPList. The event is hosted by Dovetail (thank you!) at its main San Francisco location.
Event time: Thursday, October 10, 6:00 PM.
The event is free, but advance registration is required due to limited seating.
Jakob Nielsen in an in-person fireside chat with Felix Lee this Thursday. (Ideogram: sadly it doesn’t do character consistency so these two figures don’t quite look like us. Also, it doesn’t seem to know the correct Dovetail logo.)
OpenAI Canvas
OpenAI introduced a new feature called “ChatGPT 4o with canvas” in keeping with its tradition of atrocious product and feature names. My conspiracy theory that the Abominable Snowman leads the naming team in OpenAI’s marketing department keeps seeming more and more real with each new terrible name.
The Abominable Snowman proudly wearing his new product swag while working on his next bad product name for OpenAI. (Midjourney)
Canvas is actually an improved editing feature, not a drawing feature. In fact, we could change the name from ChatGPT to EditGPT. No more linear chatbot: now we can interact with the document and issue in-context instructions:
In Canvas, users can directly select parts of text and issue instructions to the AI for how they want that specific area changed. They can also use a popup menu (see next screenshot below) for shortcut commands for common edits, such as making the text longer or shorter.
Canvas offers popup menus with sliders for adjusting the length or reading level of the text: either across the entire document or for the current selection. Here, I changed a paragraph from a 10th grade reading level (mid high school) in the top screenshot to a 17th grade reading level (5 years past high school, equivalent to a master’s degree) in the bottom screenshot.
Canvas finally abandons the infinite scroll of a chatbot as the interaction paradigm for working with AI. Now, we can interact directly with the document in a 2D interaction style instead of issuing linear commands describing our desired changes and where they should happen. Point to that paragraph or sentence instead of describing it.
Pointing and clicking are the fundamentals of GUI (graphical user interfaces), which have remained the dominant interaction style for 40 years for a reason. We now have sliders; we now have select-and-edit in place. Finally, the hybrid user interface for AI that I have called for since last year.
The new features in Canvas also offer direct support for accordion editing, which is a user behavior that has been documented for over a year. While OpenAI ought to have conducted their own user research, I say “better late than never” in implementing the findings from third-party research.
Playing the accordion. This instrument serves as a metaphor for users’ common requests to stretch or condense text. (Midjourney)
Overall, “ChatGPT 4o with canvas” is a worthy UX improvement for AI. It’s only available for paid users of ChatGPT, but that’s fair enough. I am rather sad about data showing that only 10 M of the 350 M ChatGPT users pay for the “Plus” subscription. If people want to learn how to use AI, they need to use the best version, not some downgraded freebie version. Having 97% of users get an impoverished impression of AI does not bode well for these people’s future.
Meta Movie Gen: Video + Sound Generation
Facebook (sorry, Meta) has published a long paper (92 pages) about its new “Movie Gen” generative AI model for making video, images, and audio. This model is not available for the public to use, so we have to take their word for it.
One part of the paper (sections 3.5+3.6) is of interest to user experience folks because it describes their methodology for evaluating the quality of the videos (and images, though that’s less interesting).
Meta created a prompt library of 1,000 sample prompts that they used to generate test videos, both with their own model and with the main competing models. They then had human evaluators watch pairs of videos and rate which one they liked the best. The raters were asked to rate specific aspects of the videos:
Prompt alignment: how well a video shows what was being requested)
Frame consistency: the temporal consistency of generated content, focusing on object relationships and smoothness of movement.
Motion completeness: whether the video contains sufficient motion, especially for challenging or unusual prompts.
Motion naturalness: the realism of motion, ensuring natural and physically accurate movements.
Realness: Compares which video resembles real footage more closely, even for fantastical content.
Aesthetics: Evaluates the visual appeal, including content, lighting, color, and style of the generated video.
Overall quality: Human evaluators determine the better video based on a holistic assessment of the above metrics.
Meta ran head-to-head comparisons with other models and achieved the following net win rates for overall quality:
Kling 1.5: 3.9%
Sora (OpenAI unreleased model): 8.2%
Runway Gen3: 35.0%
Luma Dream Machine: 60.6%
The net win rate is the percentage of time Meta’s video won minus the percentage of times the competing model’s video won. For example, a net win rate of 10% would indicate that Meta was best 55% of the time and the competitor was best 45% of the time.
Meta had pairs of AI video models cross swords to see how often each one would win in generating the best video for each of 1,000 prompts. (Midjourney)
Since Meta’s video model is not available to the public, its video quality is fairly irrelevant, but the data shows that we ought to clamor for it to be released because it seems to be better than all the others, winning all the contests.
Of the video models we can actually use, the Chinese video model Kling 1.5 won big time. This aligns with my personal experience that Kling currently generates the best videos overall, as mentioned below. (Sometimes other models are better: for example, I used Luma for the video of flowing lava I made to promote this newsletter on LinkedIn.)
Sora’s performance is also irrelevant since it is also restricted to in-house use at OpenAI. Also, I think its quality is exaggerated in this study, since the Meta team was restricted to comparing their own model with the sample videos OpenAI has published for the public to gawk at. Even though OpenAI has an almost incompetent marketing department, I still think they are smart enough to publish the best results of their product and have a monster eat any flubs so that they are never seen. Thus, the Meta contest only tested the best of Sora’s video not its average videos.
AI-Created Podcast
One way of empowering creators to broaden the exposure to our work by repurposing it in alternate media forms. Google’s NotebookLM creates very dynamic podcasts based on text files.
I made a podcast based on last week’s UX Roundup newsletter. (YouTube, 9 min. video.)
NotebookLM only made the audio. I made a drawing of the podcast hosts in Midjourney and animated it with Kling 1.5 to make images for the video version of the podcast.
The Chinese AI video service Kling currently produces the highest-quality animations, but it eats credits: I animated two podcasts (the second will drop shortly), and those two small projects consumed a full month’s worth of credits at the lowest cost subscription. (As an aside, I’m looking forward to the rumored release of another Chinese video model, Seaweed, in about a month. It can supposedly generate 2-minute-long segments. That’ll make video production more manageable for this kind of animated podcast.)
Why are Chinese AI video-generators better than the American equivalents? Speculation abounds in the creator community, but my guess is simply that they have access to enormous amounts of training data in the form of short-format footage from traditional video sites like Youku, Bilibili, Tiktok/Douyin, etc. which are hugely popular in China.
Two podcast hosts discussed my newsletter from last week. They liked it. But then they’re AI, so they are supposed to like me — this should have been Asimov’s fourth law of robotics. (Midjourney)
Translated Video Subtitles
Subtitles are an important accessibility feature for video. Luckily, subtitles are easy to generate with AI: I use TurboScribe, which is more accurate than YouTube’s built-in auto-generated subtitles.
Once you have subtitles for a video’s native language, you can make AI translations with less than a minute’s work for each additional language. For example, my video on The 4 AI Metaphors: AI as Intern, Coworker, Teacher, Coach has subtitles in 10 languages: Arabic, Chinese (simplified), English (the language spoken in the video), Danish, French, German, Hindi, Japanese, Korean, and Spanish.
The translations aren’t perfect; that’s why I made a version in my native language, Danish. This isn’t really required since almost all Danes understand English, but I could appreciate the nuances in the translation better in this language than in, say, Japanese, where I only know the word はい (hai).
While not super-elegant, having translated captions for the main regions you serve makes your content accessible to a much broader audience. Current AI quality gets the point across. Next-generation AI (expected around December 2024) will almost certainly step up its international game and achieve the elegant phrasing you deserve.
Do you speak any of the other languages? If so, let me know in the comments what you think of the translation quality.
I used Claude Sonnet 3.5 with this prompt: “I have uploaded a text file in SRT format with video captions in English. Please translate the English text in this file into Danish and output in SRT format, using the same time codes as in the English version.”
I wouldn’t claim that adding translated subtitles to your videos will bring about world peace. But multilingual subtitles are easy to produce with AI and will bring your content to a larger, worldwide audience. (Midjourney)
Fast-Tracking AI Adoption
The rapid rise of generative AI is nothing short of astonishing. In August 2024, a new study revealed that 39% of the working-age population in the United States had used generative AI, with 32% using it at least once during the week of the survey, and an additional 7% classified as infrequent users. This trend is not merely about casual experimentation; it represents the growing impact of AI on everyday professional activities.
Alexander Bick (Federal Reserve Bank of St. Louis), Adam Blandin (Vanderbilt University), and David J. Deming (Harvard) used the Real-Time Population Survey to question 5,014 Americans, aged 18-64 about their AI use.
The specific percentage of people using AI may not be that important, since we’re still early in the game, with only primitive AI available to users outside the AI labs. Also, the general population includes many people who are not knowledge workers and may not benefit that much from current AI. (Things will be different after the next generation of AI hits the street in late 2024 or early 2025, and especially when the Ph.D.-level AI is released around 2027 and superintelligent AI around 2030.)
It's more interesting to compare the uptake of AI with that of previous technology revolutions. Decent AI is now about 2 years old. In contrast, when the consumer Internet was 2 years old, only 20% of Americans were Internet users. For personal computers it took 3 years to reach 20% penetration. (The authors count commercial PC use as starting in 1981 with the launch of the IBM PC and commercial Internet use as starting in 1995 when the National Science Foundation stopped running the Internet and started allowing commercial traffic on the network.)
AI adoption is occurring at twice the speed of the Internet and more than double the rate of PCs. This confirms what I have said repeatedly, that AI will have much stronger impact on the world than the Internet did. In fact, I compare the AI Revolution with the Industrial Revolution, not the Internet or the PC, important as they were.
AI is advancing faster than previous tech revolutions did. (Midjourney)
Similar results show strong adoption of AI among knowledge workers in Denmark. Anders Humlum (University of Chicago) and Emilie Vestergaard (University of Copenhagen) surveyed 100,000 Danish professionals between November 2023 and January 2024 from professions such as accountants, customer support specialists, financial advisors, HR professionals, IT support specialists, journalists, legal professionals, marketing professionals, office clerks, software developers, and teachers.
Half of these Danish professionals had used ChatGPT, with adoption rates ranging from 79% for software developers to 34% for financial advisors.
There are some unfortunate differences between demographic groups: most strikingly, women had a 20 percentage point lower adoption of AI than men. Education was positive but less than I would have expected, with each year of schooling adding 0.1 percentage point to AI adoption. Thus, the difference between people with a high school diploma and a master’s degree is less than 1%. Finally, age was a negative, with each year of age subtracting one percentage point from AI adoption. However, old users might actually stand to benefit the most from AI.
Generative AI's unique strength lies in ideation — something that could help offset the decline in fluid intelligence associated with aging. In this context, AI serves as a creativity booster, enabling older professionals to generate new ideas and innovate at levels that might otherwise become increasingly challenging.
Young people use AI much more than old people do. (Ideogram)
AI Retention Strong
It’s not just AI adoption that moves fast. Retention is also strong: people tend to keep going once they start using AI. Ramp (a purveyor of corporate credit cards) reports that as of Q2 2024, company spending on AI (charged to their cards) increased by a whopping 375% compared to a year ago. Again, that fast adoption. More importantly, of companies that bought an AI product, 70% are still paying for it a year later.
AI is sticky for businesses. Once a company starts using an AI product, it keeps paying. (Midjourney)
In another example, Stripe (which processes credit-card charges for merchants, so basically the opposite service of Ramp) analyzed their top 100 AI clients now versus their top 100 software-as-a-service (SaaS) clients when that technology wave was at its peak in 2018. They found that the average top-100 AI vendor reached $5M in charges in about 24 months, whereas it had taken the top SaaS companies 46 months to reach that same level of customer fees.
New Version 1.1 of Flux Image Generator
German product Flux is currently one of the best AI image generation models. It released an upgraded version 1.1 last week. The creator community is mainly happy with the upgrade, which seems to produce better-looking images overall.
Flux Pro 1.1 currently tops the AI Leaderboard for Image Generation, with an Elo score of 1153, compared to 1108 for Ideogram 2 and 1100 for Midjourney 6.1. These scores are biased by the question used to collect user feedback, which asks “Which image best reflects this prompt?” This phrasing prioritizes prompt adherence (where Flux and Ideogram shine) over image quality and stylistic variations (where Midjourney shine — I do hope MJ will prioritize prompt adherence in version 7). Dall-E 3 trails far behind with a 1005 Elo: it’s a disgrace that OpenAI hasn’t updated its image model for so long.
However, when I tried my test prompt to make images of a 3-member K-Pop group dancing in a TV production, I didn’t think much of Flux’s 1.1 upgrade.
Comparing Flux Pro version 1.0 (left) with version 1.1 (right). The top row asked for a color pencil drawing, whereas the bottom row asked for a photorealistic image.
As you can see, 3 of 4 images had sufficient prompt adherence to limit the group to 3 dancers, even though 4-member groups are much more common in K-Pop and thus will be vastly better represented in the training data. However, the one failure was in the supposedly-better version 1.1.
For the photoreal images, I prefer version 1.0. I did ask for a wide view, which may have induced version 1.1 to letterbox the image to make it look more like a widescreen movie (even though I asked for a TV production). Depending on how you feel about “wide view,” you could argue that this was more prompt adherent.
Neither of the two supposed drawings look much like drawings, though I think version 1.1 made the stage set look more like a drawing (its 4 dancers look photorealistic).
In defense of Flux, I should say that I wrote this prompt to judge Midjourney’s image quality, so it’s not optimized for Flux’s prompt interpretation. That said, I don’t think users should be expected to know the intricacies of each AI model to use it.
From this very limited experiment with two styles of a single prompt, my conclusion is that I’m unimpressed with Flux version 1.1.
To compare Flux with Midjourney, see my results from this exact same prompts in Midjourney 6.1 two months ago, or my video reviewing Midjourney’s image quality since version 1. I think Midjourney remains the best for a wider range of styles than pure photorealism. (Of course, we’re all waiting for Midjourney 7 to drop in the hope that it’ll be even better. Any day now.)
About the Author
Jakob Nielsen, Ph.D., is a usability pioneer with 41 years experience in UX and the Founder of UX Tigers. He founded the discount usability movement for fast and cheap iterative design, including heuristic evaluation and the 10 usability heuristics. He formulated the eponymous Jakob’s Law of the Internet User Experience. Named “the king of usability” by Internet Magazine, “the guru of Web page usability” by The New York Times, and “the next best thing to a true time machine” by USA Today.
Previously, Dr. Nielsen was a Sun Microsystems Distinguished Engineer and a Member of Research Staff at Bell Communications Research, the branch of Bell Labs owned by the Regional Bell Operating Companies. He is the author of 8 books, including the best-selling Designing Web Usability: The Practice of Simplicity (published in 22 languages), the foundational Usability Engineering (27,558 citations in Google Scholar), and the pioneering Hypertext and Hypermedia (published two years before the Web launched).
Dr. Nielsen holds 79 United States patents, mainly on making the Internet easier to use. He received the Lifetime Achievement Award for Human–Computer Interaction Practice from ACM SIGCHI and was named a “Titan of Human Factors” by the Human Factors and Ergonomics Society.
· Subscribe to Jakob’s newsletter to get the full text of new articles emailed to you as soon as they are published.
· Read: article about Jakob Nielsen’s career in UX
· Watch: Jakob Nielsen’s 41 years in UX (8 min. video)
Converted my newsletter to a design podcast channel with AI too: https://open.spotify.com/episode/7DVfQ18qECuaHIoR5bwprU?si=yaTiQ9PGQTavmG47RCepug
The speed and quality are mind-blowing. 🤯