Malware Scanning on Hugging Face

What is ClamAV?

ClamAV® is an open-source antivirus engine for detecting trojans, viruses, malware & other malicious threats.

Files are scanned rapidly using regex-like signatures that differentiate clean and malicious/unwanted files.

As part of our partnership with Hugging Face, Foundation AI operates custom-fit signatures that detect deserialization risks in common model file formats such as .pt and .pkl, in addition to scanning for traditional malware.

What should I do?

ClamAV scans every file in the repository for traces of malware (also known as "signatures"). Multiple matches can be found for a single file.

| Model Malware

If the signature in your alert begins with the string Py.Malware, it is for model-specific malware.

How it Works

Each signature looks for import(s) of a specific Python module within the serialized model file. These modules are considered suspicious in the context of ML models.

In the example above, Py.Malware.NetAccess_webbrowser_ANY_GLOBAL indicates that the model is importing the webbrowser module during deserialization, which constitutes a potential network access threat (NetAccess).

All model malware signatures follow the {risk category}_{module}_{function}_{opcode} naming convention.

Risk Categories

Code Execution (CodeExec): Allows attackers to execute arbitrary code within the target environment, potentially leading to full system compromise, unauthorized access, and data manipulation.
System Access (SysAccess): Allows attackers to execute system-level commands on the host OS, which can result in unauthorized access, privilege escalation, and control over the system.
Network Access (NetAccess): Allows attackers to exploit weaknesses in network communications or remote access, enabling unauthorized access, data interception, or remote system control.
Obfuscation Vulnerabilities (Obfuscation): Allows attackers to hide execution via deserialization processes, potentially leading to arbitrary code execution or unintended actions within the application.

Recommendations

Confirm whether the detected Python module is relevant for the model based on its specified task or architecture.
Our signatures are intended to only flag modules which have virtually no justification for usage during model deserialization (e.g., subprocess, requests.post, socketserver), but false positives may occur.
Verify the file hash with a trusted 3rd party: https://talosintelligence.com/reputation_center
This helps confirm whether the file has been flagged by other security vendors and provides additional context about the threat.
The file hash is available in the SHA256 field on the Hugging Face file details page.
For models, look for a safetensors version.
If one does not exist, ask the provider to release one!
For other filetypes, determine 1) whether there is a true pickle deserialization risk and 2) whether it is appropriate for this type of file to contain the detected Python module(s).
Suppose an alert flags for "system access via os.environ" in a .jsonl dataset containing code snippet data.
Because the file is not pickle-based, it's unlikely that there is a pickle deserialization risk. Furthermore, because the dataset intentionally contains code snippets, it's reasonable for it to contain instances of "os.environ" (though inspecting the actual references is always a good measure).

If this is your model and you believe the flag is a false positive, please reach out to aiscrm-fdtn-ai-support@cisco.com.

| Other Malware

If the signature in your alert does not begin with the string Py.Malware, it is a standard ClamAV malware signature.

ClamAV has been detecting malware since 2002! Millions of unique signatures are available out of the box, detecting trojans, botnets, crypto miners, loaders, and other malware.

Recommendations

Verify the file hash with a trusted 3rd party: https://talosintelligence.com/reputation_center
This helps confirm whether the file has been flagged by other security vendors and provides additional context about the threat.
The file hash is available in the SHA256 field on the Hugging Face file details page.
Analyze the detected signature via sigtool.
Each ClamAV signature is designed to detect a specific malware pattern within a file's contents.
You can inspect the underlying pattern by decoding the signature with sigtool.
For example, if the scanner in Hugging Face detected a virus called Eicar-Signature (a known test file in malware detection), we could decode that signature like so:
```
sigtool --find-sigs Eicar-Signature | sigtool --decode
```
```
VIRUS NAME: Eicar-Signature.
TDB: Engine:56-255,Target:0
LOGICAL EXPRESSION: 0
* SUBSIG ID 0
+-> OFFSET: 0
+-> SIGMOD: NONE
+-> DECODED SUBSIGNATURE:
X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*
```
The final line contains the actual string that was detected in the file (X50 ... +H*).