skills$openclaw/file-deduplicator
michael-laffin8.4k

by michael-laffin

file-deduplicator – OpenClaw Skill

file-deduplicator is an OpenClaw Skills integration for coding workflows. Find and remove duplicate files intelligently. Save storage space, keep your system clean. Perfect for digital hoarders and document management.

8.4k stars9.0k forksSecurity L1
Updated Feb 7, 2026Created Feb 7, 2026coding

Skill Snapshot

namefile-deduplicator
descriptionFind and remove duplicate files intelligently. Save storage space, keep your system clean. Perfect for digital hoarders and document management. OpenClaw Skills integration.
ownermichael-laffin
repositorymichael-laffin/file-deduplicator
languageMarkdown
licenseMIT
topics
securityL1
installopenclaw add @michael-laffin/file-deduplicator
last updatedFeb 7, 2026

Maintainer

michael-laffin

michael-laffin

Maintains file-deduplicator in the OpenClaw Skills directory.

View GitHub profile
File Explorer
7 files
.
_meta.json
294 B
config.json
726 B
index.js
17.5 KB
package.json
407 B
README.md
6.6 KB
SKILL.md
11.0 KB
test.js
2.5 KB
SKILL.md

name: file-deduplicator description: Find and remove duplicate files intelligently. Save storage space, keep your system clean. Perfect for digital hoarders and document management. metadata: { "openclaw": { "version": "1.0.0", "author": "Vernox", "license": "MIT", "tags": ["deduplication", "storage", "cleanup", "file-management", "duplicate", "disk-space"], "category": "tools" } }

File-Deduplicator - Find and Remove Duplicates

Vernox Utility Skill - Clean up your digital hoard.

Overview

File-Deduplicator is an intelligent file duplicate finder and remover. Uses content hashing to identify identical files across directories, then provides options to remove duplicates safely.

Features

✅ Duplicate Detection

  • Content-based hashing (MD5) for fast comparison
  • Size-based detection (exact match, near match)
  • Name-based detection (similar filenames)
  • Directory scanning (recursive)
  • Exclude patterns (.git, node_modules, etc.)

✅ Removal Options

  • Auto-delete duplicates (keep newest/oldest)
  • Interactive review before deletion
  • Move to archive instead of delete
  • Preserve permissions and metadata
  • Dry-run mode (preview changes)

✅ Analysis Tools

  • Duplicate count summary
  • Space savings estimation
  • Largest duplicate files
  • Most common duplicate patterns
  • Detailed report generation

✅ Safety Features

  • Confirmation prompts before deletion
  • Backup to archive folder
  • Size threshold (don't remove huge files by mistake)
  • Whitelist important directories
  • Undo functionality (log for recovery)

Installation

clawhub install file-deduplicator

Quick Start

Find Duplicates in Directory

const result = await findDuplicates({
  directories: ['./documents', './downloads', './projects'],
  options: {
    method: 'content',  // content-based comparison
    includeSubdirs: true
  }
});

console.log(`Found ${result.duplicateCount} duplicate groups`);
console.log(`Potential space savings: ${result.spaceSaved}`);

Remove Duplicates Automatically

const result = await removeDuplicates({
  directories: ['./documents', './downloads'],
  options: {
    method: 'content',
    keep: 'newest',  // keep newest, delete oldest
    action: 'delete',  // or 'move' to archive
    autoConfirm: false  // show confirmation for each
  }
});

console.log(`Removed ${result.filesRemoved} duplicates`);
console.log(`Space saved: ${result.spaceSaved}`);

Dry-Run Preview

const result = await removeDuplicates({
  directories: ['./documents', './downloads'],
  options: {
    method: 'content',
    keep: 'newest',
    action: 'delete',
    dryRun: true  // Preview without actual deletion
  }
});

console.log('Would remove:');
result.duplicates.forEach((dup, i) => {
  console.log(`${i+1}. ${dup.file}`);
});

Tool Functions

findDuplicates

Find duplicate files across directories.

Parameters:

  • directories (array|string, required): Directory paths to scan
  • options (object, optional):
    • method (string): 'content' | 'size' | 'name' - comparison method
    • includeSubdirs (boolean): Scan recursively (default: true)
    • minSize (number): Minimum size in bytes (default: 0)
    • maxSize (number): Maximum size in bytes (default: 0)
    • excludePatterns (array): Glob patterns to exclude (default: ['.git', 'node_modules'])
    • whitelist (array): Directories to never scan (default: [])

Returns:

  • duplicates (array): Array of duplicate groups
    • duplicateCount (number): Number of duplicate groups found
    • totalFiles (number): Total files scanned
    • scanDuration (number): Time taken to scan (ms)
    • spaceWasted (number): Total bytes wasted by duplicates
    • spaceSaved (number): Potential savings if duplicates removed

removeDuplicates

Remove duplicate files based on findings.

Parameters:

  • directories (array|string, required): Same as findDuplicates
  • options (object, optional):
    • keep (string): 'newest' | 'oldest' | 'smallest' | 'largest' - which to keep
    • action (string): 'delete' | 'move' | 'archive'
    • archivePath (string): Where to move files when action='move'
    • dryRun (boolean): Preview without actual action
    • autoConfirm (boolean): Auto-confirm deletions
    • sizeThreshold (number): Don't remove files larger than this

Returns:

  • filesRemoved (number): Number of files removed/moved
  • spaceSaved (number): Bytes saved
  • groupsProcessed (number): Number of duplicate groups handled
  • logPath (string): Path to action log
  • errors (array): Any errors encountered

analyzeDirectory

Analyze a single directory for duplicates.

Parameters:

  • directory (string, required): Path to directory
  • options (object, optional): Same as findDuplicates options

Returns:

  • fileCount (number): Total files in directory
  • totalSize (number): Total bytes in directory
  • duplicateSize (number): Bytes in duplicate files
  • duplicateRatio (number): Percentage of files that are duplicates

Use Cases

Digital Hoarder Cleanup

  • Find duplicate photos/videos
  • Identify wasted storage space
  • Remove old duplicates, keep newest
  • Clean up download folders

Document Management

  • Find duplicate PDFs, docs, reports
  • Keep latest version, archive old versions
  • Prevent version confusion
  • Reduce backup bloat

Project Cleanup

  • Find duplicate source files
  • Remove duplicate build artifacts
  • Clean up node_modules duplicates
  • Save storage on SSD/HDD

Backup Optimization

  • Find duplicate backup files
  • Remove redundant backups
  • Identify what's actually duplicated
  • Save space on backup drives

Configuration

Edit config.json:

{
  "detection": {
    "defaultMethod": "content",
    "sizeTolerancePercent": 0,  // exact match only
    "nameSimilarity": 0.7,  // 0-1, lower = more similar
    "includeSubdirs": true
  },
  "removal": {
    "defaultAction": "delete",
    "defaultKeep": "newest",
    "archivePath": "./archive",
    "sizeThreshold": 10485760,  // 10MB threshold
    "autoConfirm": false,
    "dryRunDefault": false
  },
  "exclude": {
    "patterns": [".git", "node_modules", ".vscode", ".idea"],
    "whitelist": ["important", "work", "projects"]
  }
}

Content-Based (Recommended)

  • Fast MD5 hashing
  • Detects exact duplicates regardless of filename
  • Works across renamed files
  • Perfect for documents, code, archives

Size-Based

  • Compares file sizes
  • Faster than content hashing
  • Good for media files where content hashing is slow
  • Finds near-duplicates (similar but not exact)

Name-Based

  • Compares filenames
  • Detects similar named files
  • Good for finding version duplicates (file_v1, file_v2)

Examples

Find Duplicates in Documents

const result = await findDuplicates({
  directories: '~/Documents',
  options: {
    method: 'content',
    includeSubdirs: true
  }
});

console.log(`Found ${result.duplicateCount} duplicate sets`);
result.duplicates.slice(0, 5).forEach((set, i) => {
  console.log(`Set ${i+1}: ${set.files.length} files`);
  console.log(`  Total size: ${set.totalSize} bytes`);
});

Remove Duplicates, Keep Newest

const result = await removeDuplicates({
  directories: '~/Documents',
  options: {
    keep: 'newest',
    action: 'delete'
  }
});

console.log(`Removed ${result.filesRemoved} files`);
console.log(`Saved ${result.spaceSaved} bytes`);

Move to Archive Instead of Delete

const result = await removeDuplicates({
  directories: '~/Downloads',
  options: {
    keep: 'newest',
    action: 'move',
    archivePath: '~/Documents/Archive'
  }
});

console.log(`Archived ${result.filesRemoved} files`);
console.log(`Safe in: ~/Documents/Archive`);

Dry-Run Preview Changes

const result = await removeDuplicates({
  directories: '~/Documents',
  options: {
    dryRun: true  // Just show what would happen
  }
});

console.log('=== Dry Run Preview ===');
result.duplicates.forEach((set, i) => {
  console.log(`Would delete: ${set.toDelete.join(', ')}`);
});

Performance

Scanning Speed

  • Small directories (<1000 files): <1s
  • Medium directories (1000-10000 files): 1-5s
  • Large directories (10000+ files): 5-20s

Detection Accuracy

  • Content-based: 100% (exact duplicates)
  • Size-based: Fast but may miss renamed files
  • Name-based: Detects naming patterns only

Memory Usage

  • Hash cache: ~1MB per 100,000 files
  • Batch processing: Processes 1000 files at a time
  • Peak memory: ~200MB for 1M files

Safety Features

Size Thresholding

Won't remove files larger than configurable threshold (default: 10MB). Prevents accidental deletion of important large files.

Archive Mode

Move files to archive directory instead of deleting. No data loss, full recoverability.

Action Logging

All deletions/moves are logged to file for recovery and audit.

Undo Functionality

Log file can be used to restore accidentally deleted files (limited undo window).

Error Handling

Permission Errors

  • Clear error message
  • Suggest running with sudo
  • Skip files that can't be accessed

File Lock Errors

  • Detect locked files
  • Skip and report
  • Suggest closing applications using files

Space Errors

  • Check available disk space before deletion
  • Warn if space is critically low
  • Prevent disk-full scenarios

Troubleshooting

Not Finding Expected Duplicates

  • Check detection method (content vs size vs name)
  • Verify exclude patterns aren't too broad
  • Check if files are in whitelisted directories
  • Try with includeSubdirs: false

Deletion Not Working

  • Check write permissions on directories
  • Verify action isn't 'delete' with autoConfirm: true
  • Check size threshold isn't blocking all deletions
  • Check file locks (is another program using files?)

Slow Scanning

  • Reduce includeSubdirs scope
  • Use size-based detection (faster)
  • Exclude large directories (node_modules, .git)
  • Process directories individually instead of batch

Tips

Best Results

  • Use content-based detection for documents (100% accurate)
  • Run dry-run first to preview changes
  • Archive instead of delete for important files
  • Check logs if anything unexpected deleted

Performance Optimization

  • Process frequently used directories first
  • Use size threshold to skip large media files
  • Exclude hidden directories from scan
  • Process directories in parallel when possible

Space Management

  • Regular duplicate cleanup prevents storage bloat
  • Delete temp directories regularly
  • Clear download folders of installers
  • Empty trash before large scans

Roadmap

  • Duplicate detection by image similarity
  • Near-duplicate detection (similar but not exact)
  • Duplicate detection across network drives
  • Cloud storage integration (S3, Google Drive)
  • Automatic scheduling of scans
  • Heuristic duplicate detection (ML-based)
  • Recover deleted files from backup
  • Duplicate detection by file content similarity (not just hash)

License

MIT


Find duplicates. Save space. Keep your system clean. 🔮

README.md

File-Deduplicator

Find and remove duplicate files intelligently. Save storage space, keep your system clean.

Quick Start

# Install
clawhub install file-deduplicator

# Find duplicates in directory
cd ~/.openclaw/skills/file-deduplicator
node index.js findDuplicates '{"directories":["./documents","./downloads"],"options":{"method":"content"}}'

# Remove duplicates automatically
node index.js removeDuplicates '{"directories":["./documents"],"options":{"keep":"newest","action":"delete"}}'

Usage Examples

Find All Types of Duplicates

const result = await findDuplicates({
  directories: ['~/Documents', '~/Downloads'],
  options: {
    method: 'all'  // Check content, size, and name
  }
});

console.log(`Found ${result.duplicates.length} duplicate groups`);
console.log(`Space wasted: ${formatBytes(result.totalWasted)}`);

Content-Based Duplicates (Most Accurate)

const result = await findDuplicates({
  directories: ['~/Documents'],
  options: {
    method: 'content'  // MD5 hashing
  }
});

Size-Based Duplicates (Fastest)

const result = await findDuplicates({
  directories: ['~/Pictures', '~/Videos'],
  options: {
    method: 'size',
    minSize: 1048576,  // 1MB minimum
    maxSize: 104857600  // 100MB maximum
  }
});

Name-Based Duplicates (Find Renamed Copies)

const result = await findDuplicates({
  directories: ['~/Documents'],
  options: {
    method: 'name',
    nameSimilarity: 0.7  // 70% similarity threshold
  }
});

Tool Functions

findDuplicates

Find duplicate files across directories.

Parameters:

  • directories (array, required): Directory paths to scan
  • options (object, optional):
    • method (string): 'content' | 'size' | 'name' | 'all' (default: 'all')
    • includeSubdirs (boolean): Scan recursively (default: true)
    • minSize (number): Minimum size in bytes (default: 0)
    • maxSize (number): Maximum size in bytes (default: 0)
    • excludePatterns (array): Glob patterns to exclude (default: '.git', 'node_modules')
    • nameSimilarity (number): 0-1 for name-based (default: 0.7)
    • whitelist (array): Directories to never scan

Returns:

  • duplicates (array): Array of duplicate groups
  • totalFiles (number): Total files scanned
  • method (string): Detection method used

removeDuplicates

Remove or move duplicate files.

Parameters:

  • directories (array, required): Same as findDuplicates
  • options (object, optional):
    • keep (string): 'newest' | 'oldest' | 'smallest' | 'largest' (default: 'newest')
    • action (string): 'delete' | 'move' | 'archive' (default: 'delete')
    • archivePath (string): Where to move files when action='move'
    • dryRun (boolean): Preview without actual action
    • sizeThreshold (number): Don't remove files larger than this (default: 10MB)

Returns:

  • filesRemoved (number): Number of files removed/moved
  • spaceSaved (number): Bytes saved
  • errors (array): Error details
  • logPath (string): Path to action log

analyzeDirectory

Analyze a single directory for duplicate statistics.

Parameters:

  • directory (string, required): Path to directory
  • options (object, optional): Same as findDuplicates options

Returns:

  • fileCount (number): Total files scanned
  • duplicateCount (number): Number of duplicate groups
  • duplicateSize (number): Total bytes in duplicates
  • totalSize (number): Total bytes in directory
  • duplicateRatio (number): Percentage of files that are duplicates
  • scanDuration (number): Time taken to scan (ms)

Use Cases

Digital Hoarder Cleanup

  • Find duplicate photos/videos
  • Identify wasted storage space
  • Remove old duplicates, keep newest
  • Clean up download folders

Document Management

  • Find duplicate PDFs, docs, reports
  • Keep latest version, archive old versions
  • Prevent version confusion
  • Reduce backup bloat

Project Cleanup

  • Find duplicate source files
  • Remove duplicate build artifacts
  • Clean up node_modules duplicates
  • Save storage on SSD/HDD

Backup Optimization

  • Find duplicate backup files
  • Remove redundant backups
  • Identify what's actually duplicated
  • Save space on backup drives

Configuration

Edit config.json to customize:

{
  "detection": {
    "defaultMethod": "content",
    "sizeTolerancePercent": 0,
    "nameSimilarity": 0.7,
    "includeSubdirs": true
  },
  "removal": {
    "defaultAction": "delete",
    "defaultKeep": "newest",
    "archivePath": "./archive",
    "sizeThreshold": 10485760,
    "autoConfirm": false
  },
  "exclude": {
    "patterns": [".git", "node_modules", ".vscode", ".idea"],
    "whitelist": ["important", "work", "projects"]
  }
}

Performance

Scanning Speed

  • Small directories (<1000 files): <1s
  • Medium directories (1000-10000 files): 1-5s
  • Large directories (10000+ files): 5-20s

Detection Accuracy

  • Content-based: 100% (exact duplicates)
  • Size-based: Fast but may miss renamed files
  • Name-based: Detects naming patterns only

Memory Usage

  • Hash cache: ~1MB per 100,000 files
  • Batch processing: Processes 1000 files at a time
  • Peak memory: ~100MB for 1M files

Safety Features

Size Thresholding

Won't remove files larger than configurable threshold (default: 10MB). Prevents accidental deletion of important large files.

Archive Mode

Move files to archive directory instead of deleting. No data loss, full recoverability.

All deletions/moves are logged to file for recovery and audit.

Tips

Best Results

  • Use content-based detection for documents (100% accurate)
  • Use size-based for media files (faster)
  • Run dry-run first to preview changes
  • Archive instead of delete for important files

Performance Optimization

  • Process frequently used directories first
  • Use size threshold to skip large media files
  • Exclude hidden directories from scan
  • Process directories in parallel when possible

Troubleshooting

Not Finding Expected Duplicates

  • Check detection method (content vs size vs name)
  • Verify exclude patterns aren't too broad
  • Check if files are in whitelisted directories

Removal Not Working

  • Check write permissions on directories
  • Verify action isn't 'delete' with autoConfirm: true
  • Check size threshold isn't blocking all deletions
  • Check file locks (is another program using files?)

Slow Scanning

  • Reduce includeSubdirs scope
  • Use size-based detection (faster)
  • Exclude large directories (node_modules, .git)
  • Process directories individually instead of batch

License

MIT


Find duplicates. Save space. Keep your system clean. 🔮

Permissions & Security

Security level L1: Low-risk skills with minimal permissions. Review inputs and outputs before running in production.

Requirements

  • OpenClaw CLI installed and configured.
  • Language: Markdown
  • License: MIT
  • Topics:

Configuration

### Edit `config.json`: ```json { "detection": { "defaultMethod": "content", "sizeTolerancePercent": 0, // exact match only "nameSimilarity": 0.7, // 0-1, lower = more similar "includeSubdirs": true }, "removal": { "defaultAction": "delete", "defaultKeep": "newest", "archivePath": "./archive", "sizeThreshold": 10485760, // 10MB threshold "autoConfirm": false, "dryRunDefault": false }, "exclude": { "patterns": [".git", "node_modules", ".vscode", ".idea"], "whitelist": ["important", "work", "projects"] } } ```

FAQ

How do I install file-deduplicator?

Run openclaw add @michael-laffin/file-deduplicator in your terminal. This installs file-deduplicator into your OpenClaw Skills catalog.

Does this skill run locally or in the cloud?

OpenClaw Skills execute locally by default. Review the SKILL.md and permissions before running any skill.

Where can I verify the source code?

The source repository is available at https://github.com/openclaw/skills/tree/main/skills/michael-laffin/file-deduplicator. Review commits and README documentation before installing.