Researchers at Endgame, a cyber-security biz based in Virginia, have published what they believe is the first large open-source dataset for machine learning malware detection known as EMBER.
EMBER contains metadata describing 1.1 million Windows portable executable files: 900,000 training samples evenly split into malicious, benign, and unlabeled categories and 200,000 files of test samples labelled as malicious and benign.
“We’re trying to push the dark arts of infosec research into an open light. EMBER will make AI research more transparent and reproducible,” Hyrum Anderson, co-author of the study to be presented at the RSA conference this week in San Francisco, told The Register.
Progress in AI is driven by data. Researchers compete with one another by building models and training them on benchmark datasets to reach ever increasing accuracies.
Computer vision is flooded with numerous datasets containing millions of annotated pictures for image recognition tasks, and natural language processing has various text-based datasets to test machine reading and comprehension skills. this has helped a lot in building AI image processing.
Although there is a strong interest in using AI for information security – look at DARPA’s Cyber Grand Challenge where academics developed software capable of hunting for security bugs autonomously – it’s an area that doesn’t really have any public datasets.