The NY Times Lawsuit Against OpenAI Would Open Up The NY Times To All Sorts Of Lawsuits Should It Win, shows that if you feed it a URL it can regurgitate what’s on the first parts of that URL

Posted on December 31, 2023 by Robin Edgar

This week the NY Times somehow broke the story of… well, the NY Times suing OpenAI and Microsoft. I wonder who tipped them off. Anyhoo, the lawsuit in many ways is similar to some of the over a dozen lawsuits filed by copyright holders against AI companies. We’ve written about how silly many of these lawsuits are, in that they appear to be written by people who don’t much understand copyright law. And, as we noted, even if courts actually decide in favor of the copyright holders, it’s not like it will turn into any major windfall. All it will do is create another corruptible collection point, while locking in only a few large AI companies who can afford to pay up.

I’ve seen some people arguing that the NY Times lawsuit is somehow “stronger” and more effective than the others, but I honestly don’t see that. Indeed, the NY Times itself seems to think its case is so similar to the ridiculously bad Authors Guild case, that it’s looking to combine the cases.

But while there are some unique aspects to the NY Times case, I’m not sure they are nearly as compelling as the NY Times and its supporters think they are. Indeed, I think if the Times actually wins its case, it would open the Times itself up to some fairly damning lawsuits itself, given its somewhat infamous journalistic practices regarding summarizing other people’s articles without credit. But, we’ll get there.

The Times, in typical NY Times fashion, presents this case as thought the NY Times is the great defender of press freedom, taking this stand to stop the evil interlopers of AI.

Independent journalism is vital to our democracy. It is also increasingly rare and valuable. For more than 170 years, The Times has given the world deeply reported, expert, independent journalism. Times journalists go where the story is, often at great risk and cost, to inform the public about important and pressing issues. They bear witness to conflict and disasters, provide accountability for the use of power, and illuminate truths that would otherwise go unseen. Their essential work is made possible through the efforts of a large and expensive organization that provides legal, security, and operational support, as well as editors who ensure their journalism meets the highest standards of accuracy and fairness. This work has always been important. But within a damaged information ecosystem that is awash in unreliable content, The Times’s journalism provides a service that has grown even more valuable to the public by supplying trustworthy information, news analysis, and commentary

Defendants’ unlawful use of The Times’s work to create artificial intelligence products that compete with it threatens The Times’s ability to provide that service. Defendants’ generative artificial intelligence (“GenAI”) tools rely on large-language models (“LLMs”) that were built by copying and using millions of The Times’s copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more. While Defendants engaged in widescale copying from many sources, they gave Times content particular emphasis when building their LLMs—revealing a preference that recognizes the value of those works. Through Microsoft’s Bing Chat (recently rebranded as “Copilot”) and OpenAI’s ChatGPT, Defendants seek to free-ride on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment.

As the lawsuit makes clear, this isn’t some high and mighty fight for journalism. It’s a negotiating ploy. The Times admits that it has been trying to get OpenAI to cough up some cash for its training:

For months, The Times has attempted to reach a negotiated agreement with Defendants, in accordance with its history of working productively with large technology platforms to permit the use of its content in new digital products (including the news products developed by Google, Meta, and Apple). The Times’s goal during these negotiations was to ensure it received fair value for the use of its content, facilitate the continuation of a healthy news ecosystem, and help develop GenAI technology in a responsible way that benefits society and supports a well-informed public.

I’m guessing that OpenAI’s decision a few weeks back to pay off media giant Axel Springer to avoid one of these lawsuits, and the failure to negotiate a similar deal (at what is likely a much higher price), resulted in the Times moving forward with the lawsuit.

There are five or six whole pages of puffery about how amazing the NY Times thinks the NY Times is, followed by the laughably stupid claim that generative AI “threatens” the kind of journalism the NY Times produces.

Let me let you in on a little secret: if you think that generative AI can do serious journalism better than a massive organization with a huge number of reporters, then, um, you deserve to go out of business. For all the puffery about the amazing work of the NY Times, this seems to suggest that it can easily be replaced by an auto-complete machine.

In the end, though, the crux of this lawsuit is the same as all the others. It’s a false belief that reading something (whether by human or machine) somehow implicates copyright. This is false. If the courts (or the legislature) decide otherwise, it would upset pretty much all of the history of copyright and create some significant real world problems.

Part of the Times complaint is that OpenAI’s GPT LLM was trained in part with Common Crawl data. Common Crawl is an incredibly useful and important resource that apparently is now coming under attack. It has been building an open repository of the web for people to use, not unlike the Internet Archive, but with a focus on making it accessible to researchers and innovators. Common Crawl is a fantastic resource run by some great people (though the lawsuit here attacks them).

But, again, this is the nature of the internet. It’s why things like Google’s cache and the Internet Archive’s Wayback Machine are so important. These are archives of history that are incredibly important, and have historically been protected by fair use, which the Times is now threatening.

(Notably, just recently, the NY Times was able to get all of its articles excluded from Common Crawl. Otherwise I imagine that they would be a defendant in this case as well).

Either way, so much of the lawsuit is claiming that GPT learning from this data is infringement. And, as we’ve noted repeatedly, reading/processing data is not a right limited by copyright. We’ve already seen this in multiple lawsuits, but this rush of plaintiffs is hoping that maybe judges will be wowed by this newfangled “generative AI” technology into ignoring the basics of copyright law and pretending that there are now rights that simply do not exist.

Now, the one element that appears different in the Times’ lawsuit is that it has a bunch of exhibits that purport to prove how GPT regurgitates Times articles. Exhibit J is getting plenty of attention here, as the NY Times demonstrates how it was able to prompt ChatGPT in such a manner that it basically provided them with direct copies of NY Times articles.

In the complaint, they show this:

At first glance that might look damning. But it’s a lot less damning when you look at the actual prompt in Exhibit J and realize what happened, and how generative AI actually works.

What the Times did is prompt GPT-4 by (1) giving it the URL of the story and then (2) “prompting” it by giving it the headline of the article and the first seven and a half paragraphs of the article, and asking it to continue.

Here’s how the Times describes this:

Each example focuses on a single news article. Examples were produced by breaking the article into two parts. The frst part o f the article is given to GPT-4, and GPT-4 replies by writing its own version of the remainder of the article.

Here’s how it appears in Exhibit J (notably, the prompt was left out of the complaint itself):

If you actually understand how these systems work, the output looking very similar to the original NY Times piece is not so surprising. When you prompt a generative AI system like GPT, you’re giving it a bunch of parameters, which act as conditions and limits on its output. From those constraints, it’s trying to generate the most likely next part of the response. But, by providing it paragraphs upon paragraphs of these articles, the NY Times has effectively constrained GPT to the point that the most probabilistic responses is… very close to the NY Times’ original story.

In other words, by constraining GPT to effectively “recreate this article,” GPT has a very small data set to work off of, meaning that the highest likelihood outcome is going to sound remarkably like the original. If you were to create a much shorter prompt, or introduce further randomness into the process, you’d get a much more random output. But these kinds of prompts effectively tell GPT not to do anything BUT write the same article.

From there, though, the lawsuit gets dumber.

It shows that you can sorta get around the NY Times’ paywall in the most inefficient and unreliable way possible by asking ChatGPT to quote the first few paragraphs in one paragraph chunks.

Of course, quoting individual paragraphs from a news article is almost certainly fair use. And, for what it’s worth, the Times itself admits that this process doesn’t actually return the full article, but a paraphrase of it.

And the lawsuit seems to suggest that merely summarizing articles is itself infringing:

That’s… all factual information summarizing the review? And while the complaint shows that if you then ask for (again, paragraph length) quotes, GPT will give you a few quotes from the article.

And, yes, the complaint literally argues that a generative AI tool can violate copyright when it “summarizes” an article.

The issue here is not so much how GPT is trained, but how the NY Times is constraining the output. That is unrelated to the question of whether or not the reading of these article is fair use or not. The purpose of these LLMs is not to repeat the content that is scanned, but to figure out the probabilistic most likely next token for a given prompt. When the Times constrains the prompts in such a way that the data set is basically one article and one article only… well… that’s what you get.

Elsewhere, the Times again complains about GPT returning factual information that is not subject to copyright law.

But, I mean, if you were to ask anyone the same question, “What does wirecutter recommend for The Best Kitchen Scale,” they’re likely to return you a similar result, and that’s not infringing. It’s a fact that that scale is the one that it recommends. The Times complains that people who do this prompt will avoid clicking on Wirecutter affiliate links, but… um… it has no right to that affiliate income.

I mean, I’ll admit right here that I often research products and look at Wirecutter (and other!) reviews before eventually shopping independently of that research. In other words, I will frequently buy products after reading the recommendations on Wirecutter, but without clicking on an affiliate link. Is the NY Times really trying to suggest that this violates its copyright? Because that’s crazy.

Meanwhile, it’s not clear if the NY Times is mad that it’s accurately recommending stuff or if it’s just… mad. Because later in the complaint, the NY Times says its bad that sometimes GPT recommends the wrong product or makes up a paragraph.

So… the complaint is both that GPT reproduces things too accurately, AND not accurately enough. Which is it?

Anyway, the larger point is that if the NY Times wins, well… the NY Times might find itself on the receiving end of some lawsuits. The NY Times is somewhat infamous in the news world for using other journalists’ work as a starting point and building off of it (frequently without any credit at all). Sometimes this results in an eventual correction, but often it does not.

If the NY Times successfully argues that reading a third party article to help its reporters “learn” about the news before reporting their own version of it is copyright infringement, it might not like how that is turned around by tons of other news organizations against the NY Times. Because I don’t see how there’s any legitimate distinction between OpenAI scanning NY Times articles and NY Times reporters scanning other articles/books/research without first licensing those works as well.

Or, say, what happens if a source for a NY TImes reporter provides them with some copyright-covered work (an article, a book, a photograph, who knows what) that the NY Times does not have a license for? Can the NY Times journalist then produce an article based on that material (along with other research, though much less than OpenAI used in training GPT)?

It seems like (and this happens all too often in the news industry) the NY Times is arguing that it’s okay for its journalists to do this kind of thing because it’s in the business of producing Important Journalism™ whereas anyone else doing the same thing is some damn interloper.

We see this with other copyright disputes and the media industry, or with the ridiculous fight over the hot news doctrine, in which news orgs claimed that they should be the only ones allowed to report on something for a while.

Similarly, I’ll note that even if the NY Times gets some money out of this, don’t expect the actual reporters to see any of it. Remember, this is the same NY Times that once tried to stiff freelance reporters by relicensing their articles to electronic databases without paying them. The Supreme Court didn’t like that. If the NY Times establishes that merely training AI on old articles is a licenseable, copyright-impacting event, will it go back and pay those reporters a piece of whatever change they get? Or nah?

Source: The NY Times Lawsuit Against OpenAI Would Open Up The NY Times To All Sorts Of Lawsuits Should It Win | Techdirt

Two EV models powered by sodium-ion batteries roll off line in China

Posted on December 31, 2023 by Robin Edgar

Two electric vehicle (EV) models powered by sodium-ion batteries have rolled off the production line in China, signaling that the new, lower-cost batteries are closer to being used on a large scale.

A model powered by sodium-ion batteries built by Farasis Energy in partnership with JMEV, an EV brand owned by Jiangling Motors Group, rolled off the assembly line on December 28, according to the battery maker.

The model, based on JMEV’s EV3, has a range of 251 km and is the first all-electric A00-class model powered by sodium-ion batteries to be built by Farasis Energy in collaboration with JMEV.

The JMEV EV3 is a compact, all-electric vehicle with a CLTC range of 301 km and a battery pack capacity of 31.15 kWh for its two lithium-ion battery versions. The starting prices for these two versions are RMB 62,800 ($8,840) and RMB 66,800, respectively.

The model’s sodium battery version starts at RMB 58,800, with a battery pack capacity of 21.4 kWh and a CLTC range of 251 km, according to its specification sheet.

Farasis Energy’s sodium-ion batteries currently in production have energy densities in the range of 140-160 Wh/kg, and the battery cells have passed tests including pin-prick, overcharging, and extrusion, according to the company.

Farasis Energy will launch the second generation of sodium-ion batteries in 2024 with an energy density of 160-180 Wh/kg, it said.

By 2026, the next generation of sodium-ion battery products will have an energy density of 180-200 Wh/kg.

On December 27, battery maker Hina Battery announced that a model powered by sodium-ion batteries, which it jointly built with Anhui Jianghuai Automobile Group Corp (JAC), rolled off the production line.

The model is a new variant of the Yiwei 3, the first model under JAC’s new Yiwei brand, and utilizes Hina Battery’s sodium-ion cylindrical cells.

(Image credit: Hina Battery)

Volume deliveries of the sodium-ion battery-equipped Yiwei model are expected to begin in January 2024, according to Hina Battery.

On February 23, Hina Battery unveiled three sodium-ion battery cell products and announced that it had entered into a partnership with JAC.

Hina Battery and Sehol — a joint venture brand between JAC and Volkswagen Anhui — would jointly build a test vehicle with sodium-ion batteries based on the latter’s Sehol E10X model, according to a statement in February.

The test vehicle’s battery pack has a capacity of 25 kWh and an energy density of 120 Wh/kg. The model has a range of 252 km and supports 3C to 4C fast charging. The battery pack uses cells with an energy density of 140 Wh/kg.

JAC launched its new brand Yiwei (钇为 for in Chinese) on April 12 and made the brand’s first model, the Yiwei 3, available on June 16.

According to information released yesterday by Hina Battery, the two are working together to build a production vehicle powered by sodium-ion batteries based on the Yiwei 3.

Source: Two EV models powered by sodium-ion batteries roll off line in China – CnEVPost

Using Local AI On The Command Line To Rename Images (And More)

Posted on December 31, 2023 by Robin Edgar

We all have a folder full of images whose filenamees resemble line noise. How about renaming those images with the help of a local LLM (large language model) executable on the command line? All that and more is showcased on [Justine Tunney]’s bash one-liners for LLMs, a showcase aimed at giving folks ideas and guidance on using a local (and private) LLM to do actual, useful work.

This is built out from the recent llamafile project, which turns LLMs into single-file executables. This not only makes them more portable and easier to distribute, but the executables are perfectly capable of being called from the command line and sending to standard output like any other UNIX tool. It’s simpler to version control the embedded LLM weights (and therefore their behavior) when it’s all part of the same file as well.

One such tool (the multi-modal LLaVA) is capable of interpreting image content. As an example, we can point it to a local image of the Jolly Wrencher logo using the following command:

llava-v1.5-7b-q4-main.llamafile --image logo.jpg --temp 0 -e -p '### User: The image has...\n### Assistant:'

Which produces the following response:

The image has a black background with a white skull and crossbones symbol.

With a different prompt (“What do you see?” instead of “The image has…”) the LLM even picks out the wrenches, but one can already see that the right pieces exist to do some useful work.

Check out [Justine]’s rename-pictures.sh script, which cleverly evaluates image filenames. If an image’s given filename already looks like readable English (also a job for a local LLM) the image is left alone. Otherwise, the picture is fed to an LLM whose output guides the generation of a new short and descriptive English filename in lowercase, with underscores for spaces.

What about the fact that LLM output isn’t entirely predictable? That’s easy to deal with. [Justine] suggests always calling these tools with the --temp 0 parameter. Setting the temperature to zero makes the model deterministic, ensuring that a same input always yields the same output.

There’s more neat examples on the Bash One-Liners for LLMs that demonstrate different ways to use a local LLM that lives in a single-file executable, so be sure to give it a look and see if you get any new ideas. After all, we have previously shown how automating tasks is almost always worth the time invested.

Source: Using Local AI On The Command Line To Rename Images (And More) | Hackaday

More useful would be to put this information into EXIF data, but it shouldn’t be too tough to tweak the command to do that instead

Novel helmet liner 30 times better at stopping concussions

Posted on December 31, 2023 by Robin Edgar

[…]

Among sportspeople and military vets, traumatic brain injury (TBI) is one of the major causes of permanent disability and death. Injury statistics show that the majority of TBIs, of which concussion is a subtype, are associated with oblique impacts, which subject the brain to a combination of linear and rotational kinetic energy forces and cause shearing of the delicate brain tissue.

To improve their effectiveness, helmets worn by military personnel and sportspeople must employ a liner material that limits both. This is where researchers from the University of Wisconsin-Madison come in. Determined to prevent – or lessen the effect of – TBIs caused by knocks to the body and head, they’ve developed a new lightweight foam material for use as a helmet liner.

[…]

For the current study, Thevamaran built upon his previous research into vertically aligned carbon nanotube (VACNT) foams – carefully arranged layers of carbon cylinders one atom thick – and their exceptional shock-absorbing capabilities. Current helmets attempt to reduce rotational motion by allowing a sliding motion between the wearer’s head and the helmet during impact. However, the researchers say this movement doesn’t dissipate energy in shear and can jam when severely compressed following a blow. Instead, their novel foam doesn’t rely on sliding layers.

Oblique impacts, associated with the majority of TBIs, subject the brain to a combination of linear and rotational shear forces

Maheswaran et al.

VACNT foam sidesteps this shortcoming via its unique deformation mechanism. Under compression, the VACNTs undergo collective sequentially progressive buckling, from increased compliance at low shear strain levels to a stiffening response at high strain levels. The formed compression buckles unfold completely, enabling the VACNT foam to accommodate large shear strains before returning to a near initial state when the load is removed.

The researchers found that at 25% precompression, the foam exhibited almost 30 times higher energy dissipation in shear – up to 50% shear strain – than polyurethane-based elastomeric foams of similar density.

[…]

The study was published in the journal Experimental Mechanics.

Source: University of Wisconsin-Madison

Source: Novel helmet liner 30 times better at stopping concussions

Amazon Gives Giant Middle Finger To Prime Video Customers, Will Charge $3 Extra A Month To Avoid Ads Starting In January

Posted on December 31, 2023 by Robin Edgar

[…]

Amazon customers already pay $15 per month, or $139 annually for Amazon Prime, which includes a subscription to Amazon’s streaming TV service. In a bid to make Wall Street happy, Amazon recently announced it would start hitting those users with entirely new streaming TV ads, something you can only avoid if you’re willing to shell out an additional $3 a month.

There was ample backlash to Amazon’s plan, but it apparently accomplished nothing. Amazon says it’s moving full steam ahead with the plan, which will begin on January 29th:

“We aim to have meaningfully fewer ads than linear TV and other streaming TV providers. No action is required from you, and there is no change to the current price of your Prime membership,” the company wrote. Customers have the option of paying an additional $2.99 per month to keep avoiding advertisements.”

If you recall, it took the cable TV, film, music, and broadcast sectors the better part of two decades before they were willing to give users affordable, online access to their content as part of a broader bid to combat piracy. There was just an endless amount of teeth gnashing by industry executives as they were pulled kicking and screaming into the future.

Despite having just gone through that experience, streaming executives refuse to learn anything from it, and are dead set on nickel and diming their users. This will inevitably drive a non-insignificant amount of those users back to piracy, at which point executives will blame the shift on absolutely everything and anything other than themselves.

[…]

Source: Amazon Gives Giant Middle Finger To Prime Video Customers, Will Charge $3 Extra A Month To Avoid Ads Starting In January | Techdirt

The Linkielist

Linking ideas with the world

Daily Archives: December 31, 2023

The NY Times Lawsuit Against OpenAI Would Open Up The NY Times To All Sorts Of Lawsuits Should It Win, shows that if you feed it a URL it can regurgitate what’s on the first parts of that URL

Two EV models powered by sodium-ion batteries roll off line in China

Using Local AI On The Command Line To Rename Images (And More)

Novel helmet liner 30 times better at stopping concussions

Amazon Gives Giant Middle Finger To Prime Video Customers, Will Charge $3 Extra A Month To Avoid Ads Starting In January