skills$openclaw/file-deduplicator

8.4k★

file-deduplicator – OpenClaw Skill

Name: file-deduplicator
Author: michael-laffin

file-deduplicator is an OpenClaw Skills integration for coding workflows. Find and remove duplicate files intelligently. Save storage space, keep your system clean. Perfect for digital hoarders and document management.

8.4k stars9.0k forksSecurity L1

Updated Feb 7, 2026Created Feb 7, 2026coding

Skill Snapshot

name	file-deduplicator
description	Find and remove duplicate files intelligently. Save storage space, keep your system clean. Perfect for digital hoarders and document management. OpenClaw Skills integration.
owner	michael-laffin
repository	michael-laffin/file-deduplicator
language	Markdown
license	MIT
topics
security	L1
install	openclaw add @michael-laffin/file-deduplicator
last updated	Feb 7, 2026

Maintainer

michael-laffin

Maintains file-deduplicator in the OpenClaw Skills directory.

View GitHub profile

File Explorer

7 files

_meta.json

294 B

config.json

726 B

index.js

17.5 KB

package.json

407 B

README.md

6.6 KB

SKILL.md

11.0 KB

test.js

2.5 KB

SKILL.md

name: file-deduplicator description: Find and remove duplicate files intelligently. Save storage space, keep your system clean. Perfect for digital hoarders and document management. metadata: { "openclaw": { "version": "1.0.0", "author": "Vernox", "license": "MIT", "tags": ["deduplication", "storage", "cleanup", "file-management", "duplicate", "disk-space"], "category": "tools" } }

File-Deduplicator - Find and Remove Duplicates

Vernox Utility Skill - Clean up your digital hoard.

Overview

File-Deduplicator is an intelligent file duplicate finder and remover. Uses content hashing to identify identical files across directories, then provides options to remove duplicates safely.

Features

✅ Duplicate Detection

Content-based hashing (MD5) for fast comparison
Size-based detection (exact match, near match)
Name-based detection (similar filenames)
Directory scanning (recursive)
Exclude patterns (.git, node_modules, etc.)

✅ Removal Options

Auto-delete duplicates (keep newest/oldest)
Interactive review before deletion
Move to archive instead of delete
Preserve permissions and metadata
Dry-run mode (preview changes)

✅ Analysis Tools

Duplicate count summary
Space savings estimation
Largest duplicate files
Most common duplicate patterns
Detailed report generation

✅ Safety Features

Confirmation prompts before deletion
Backup to archive folder
Size threshold (don't remove huge files by mistake)
Whitelist important directories
Undo functionality (log for recovery)

Installation

clawhub install file-deduplicator

Quick Start

Find Duplicates in Directory

const result = await findDuplicates({
  directories: ['./documents', './downloads', './projects'],
  options: {
    method: 'content',  // content-based comparison
    includeSubdirs: true
  }
});

console.log(`Found ${result.duplicateCount} duplicate groups`);
console.log(`Potential space savings: ${result.spaceSaved}`);

Remove Duplicates Automatically

const result = await removeDuplicates({
  directories: ['./documents', './downloads'],
  options: {
    method: 'content',
    keep: 'newest',  // keep newest, delete oldest
    action: 'delete',  // or 'move' to archive
    autoConfirm: false  // show confirmation for each
  }
});

console.log(`Removed ${result.filesRemoved} duplicates`);
console.log(`Space saved: ${result.spaceSaved}`);

Dry-Run Preview

const result = await removeDuplicates({
  directories: ['./documents', './downloads'],
  options: {
    method: 'content',
    keep: 'newest',
    action: 'delete',
    dryRun: true  // Preview without actual deletion
  }
});

console.log('Would remove:');
result.duplicates.forEach((dup, i) => {
  console.log(`${i+1}. ${dup.file}`);
});

Tool Functions

`findDuplicates`

Find duplicate files across directories.

Parameters:

directories (array|string, required): Directory paths to scan
options (object, optional):
- method (string): 'content' | 'size' | 'name' - comparison method
- includeSubdirs (boolean): Scan recursively (default: true)
- minSize (number): Minimum size in bytes (default: 0)
- maxSize (number): Maximum size in bytes (default: 0)
- excludePatterns (array): Glob patterns to exclude (default: ['.git', 'node_modules'])
- whitelist (array): Directories to never scan (default: [])

Returns:

duplicates (array): Array of duplicate groups
- duplicateCount (number): Number of duplicate groups found
- totalFiles (number): Total files scanned
- scanDuration (number): Time taken to scan (ms)
- spaceWasted (number): Total bytes wasted by duplicates
- spaceSaved (number): Potential savings if duplicates removed

`removeDuplicates`

Remove duplicate files based on findings.

Parameters:

directories (array|string, required): Same as findDuplicates
options (object, optional):
- keep (string): 'newest' | 'oldest' | 'smallest' | 'largest' - which to keep
- action (string): 'delete' | 'move' | 'archive'
- archivePath (string): Where to move files when action='move'
- dryRun (boolean): Preview without actual action
- autoConfirm (boolean): Auto-confirm deletions
- sizeThreshold (number): Don't remove files larger than this

Returns:

filesRemoved (number): Number of files removed/moved
spaceSaved (number): Bytes saved
groupsProcessed (number): Number of duplicate groups handled
logPath (string): Path to action log
errors (array): Any errors encountered

`analyzeDirectory`

Analyze a single directory for duplicates.

Parameters:

directory (string, required): Path to directory
options (object, optional): Same as findDuplicates options

Returns:

fileCount (number): Total files in directory
totalSize (number): Total bytes in directory
duplicateSize (number): Bytes in duplicate files
duplicateRatio (number): Percentage of files that are duplicates

Use Cases

Digital Hoarder Cleanup

Find duplicate photos/videos
Identify wasted storage space
Remove old duplicates, keep newest
Clean up download folders

Document Management

Find duplicate PDFs, docs, reports
Keep latest version, archive old versions
Prevent version confusion
Reduce backup bloat

Project Cleanup

Find duplicate source files
Remove duplicate build artifacts
Clean up node_modules duplicates
Save storage on SSD/HDD

Backup Optimization

Find duplicate backup files
Remove redundant backups
Identify what's actually duplicated
Save space on backup drives

Configuration

Edit `config.json`:

{
  "detection": {
    "defaultMethod": "content",
    "sizeTolerancePercent": 0,  // exact match only
    "nameSimilarity": 0.7,  // 0-1, lower = more similar
    "includeSubdirs": true
  },
  "removal": {
    "defaultAction": "delete",
    "defaultKeep": "newest",
    "archivePath": "./archive",
    "sizeThreshold": 10485760,  // 10MB threshold
    "autoConfirm": false,
    "dryRunDefault": false
  },
  "exclude": {
    "patterns": [".git", "node_modules", ".vscode", ".idea"],
    "whitelist": ["important", "work", "projects"]
  }
}

Methods

Content-Based (Recommended)

Fast MD5 hashing
Detects exact duplicates regardless of filename
Works across renamed files
Perfect for documents, code, archives

Size-Based

Compares file sizes
Faster than content hashing
Good for media files where content hashing is slow
Finds near-duplicates (similar but not exact)

Name-Based

Compares filenames
Detects similar named files
Good for finding version duplicates (file_v1, file_v2)

Examples

Find Duplicates in Documents

const result = await findDuplicates({
  directories: '~/Documents',
  options: {
    method: 'content',
    includeSubdirs: true
  }
});

console.log(`Found ${result.duplicateCount} duplicate sets`);
result.duplicates.slice(0, 5).forEach((set, i) => {
  console.log(`Set ${i+1}: ${set.files.length} files`);
  console.log(`  Total size: ${set.totalSize} bytes`);
});

Remove Duplicates, Keep Newest

const result = await removeDuplicates({
  directories: '~/Documents',
  options: {
    keep: 'newest',
    action: 'delete'
  }
});

console.log(`Removed ${result.filesRemoved} files`);
console.log(`Saved ${result.spaceSaved} bytes`);

Move to Archive Instead of Delete

const result = await removeDuplicates({
  directories: '~/Downloads',
  options: {
    keep: 'newest',
    action: 'move',
    archivePath: '~/Documents/Archive'
  }
});

console.log(`Archived ${result.filesRemoved} files`);
console.log(`Safe in: ~/Documents/Archive`);

Dry-Run Preview Changes

const result = await removeDuplicates({
  directories: '~/Documents',
  options: {
    dryRun: true  // Just show what would happen
  }
});

console.log('=== Dry Run Preview ===');
result.duplicates.forEach((set, i) => {
  console.log(`Would delete: ${set.toDelete.join(', ')}`);
});

Performance

Scanning Speed

Small directories (<1000 files): <1s
Medium directories (1000-10000 files): 1-5s
Large directories (10000+ files): 5-20s

Detection Accuracy

Content-based: 100% (exact duplicates)
Size-based: Fast but may miss renamed files
Name-based: Detects naming patterns only

Memory Usage

Hash cache: ~1MB per 100,000 files
Batch processing: Processes 1000 files at a time
Peak memory: ~200MB for 1M files

Safety Features

Size Thresholding

Won't remove files larger than configurable threshold (default: 10MB). Prevents accidental deletion of important large files.

Archive Mode

Move files to archive directory instead of deleting. No data loss, full recoverability.

Action Logging

All deletions/moves are logged to file for recovery and audit.

Undo Functionality

Log file can be used to restore accidentally deleted files (limited undo window).

Error Handling

Permission Errors

Clear error message
Suggest running with sudo
Skip files that can't be accessed

File Lock Errors

Detect locked files
Skip and report
Suggest closing applications using files

Space Errors

Check available disk space before deletion
Warn if space is critically low
Prevent disk-full scenarios

Troubleshooting

Not Finding Expected Duplicates

Check detection method (content vs size vs name)
Verify exclude patterns aren't too broad
Check if files are in whitelisted directories
Try with includeSubdirs: false

Deletion Not Working

Check write permissions on directories
Verify action isn't 'delete' with autoConfirm: true
Check size threshold isn't blocking all deletions
Check file locks (is another program using files?)

Slow Scanning

Reduce includeSubdirs scope
Use size-based detection (faster)
Exclude large directories (node_modules, .git)
Process directories individually instead of batch

Tips

Best Results

Use content-based detection for documents (100% accurate)
Run dry-run first to preview changes
Archive instead of delete for important files
Check logs if anything unexpected deleted

Performance Optimization

Process frequently used directories first
Use size threshold to skip large media files
Exclude hidden directories from scan
Process directories in parallel when possible

Space Management

Regular duplicate cleanup prevents storage bloat
Delete temp directories regularly
Clear download folders of installers
Empty trash before large scans

Roadmap

Duplicate detection by image similarity
Near-duplicate detection (similar but not exact)
Duplicate detection across network drives
Cloud storage integration (S3, Google Drive)
Automatic scheduling of scans
Heuristic duplicate detection (ML-based)
Recover deleted files from backup
Duplicate detection by file content similarity (not just hash)

License

MIT

Find duplicates. Save space. Keep your system clean. 🔮

README.md

File-Deduplicator

Find and remove duplicate files intelligently. Save storage space, keep your system clean.

Quick Start

# Install
clawhub install file-deduplicator

# Find duplicates in directory
cd ~/.openclaw/skills/file-deduplicator
node index.js findDuplicates '{"directories":["./documents","./downloads"],"options":{"method":"content"}}'

# Remove duplicates automatically
node index.js removeDuplicates '{"directories":["./documents"],"options":{"keep":"newest","action":"delete"}}'

Usage Examples

Find All Types of Duplicates

const result = await findDuplicates({
  directories: ['~/Documents', '~/Downloads'],
  options: {
    method: 'all'  // Check content, size, and name
  }
});

console.log(`Found ${result.duplicates.length} duplicate groups`);
console.log(`Space wasted: ${formatBytes(result.totalWasted)}`);

Content-Based Duplicates (Most Accurate)

const result = await findDuplicates({
  directories: ['~/Documents'],
  options: {
    method: 'content'  // MD5 hashing
  }
});

Size-Based Duplicates (Fastest)

const result = await findDuplicates({
  directories: ['~/Pictures', '~/Videos'],
  options: {
    method: 'size',
    minSize: 1048576,  // 1MB minimum
    maxSize: 104857600  // 100MB maximum
  }
});

Name-Based Duplicates (Find Renamed Copies)

const result = await findDuplicates({
  directories: ['~/Documents'],
  options: {
    method: 'name',
    nameSimilarity: 0.7  // 70% similarity threshold
  }
});

Tool Functions

`findDuplicates`

Find duplicate files across directories.

Parameters:

directories (array, required): Directory paths to scan
options (object, optional):
- method (string): 'content' | 'size' | 'name' | 'all' (default: 'all')
- includeSubdirs (boolean): Scan recursively (default: true)
- minSize (number): Minimum size in bytes (default: 0)
- maxSize (number): Maximum size in bytes (default: 0)
- excludePatterns (array): Glob patterns to exclude (default: '.git', 'node_modules')
- nameSimilarity (number): 0-1 for name-based (default: 0.7)
- whitelist (array): Directories to never scan

Returns:

duplicates (array): Array of duplicate groups
totalFiles (number): Total files scanned
method (string): Detection method used

`removeDuplicates`

Remove or move duplicate files.

Parameters:

directories (array, required): Same as findDuplicates
options (object, optional):
- keep (string): 'newest' | 'oldest' | 'smallest' | 'largest' (default: 'newest')
- action (string): 'delete' | 'move' | 'archive' (default: 'delete')
- archivePath (string): Where to move files when action='move'
- dryRun (boolean): Preview without actual action
- sizeThreshold (number): Don't remove files larger than this (default: 10MB)

Returns:

filesRemoved (number): Number of files removed/moved
spaceSaved (number): Bytes saved
errors (array): Error details
logPath (string): Path to action log

`analyzeDirectory`

Analyze a single directory for duplicate statistics.

Parameters:

directory (string, required): Path to directory
options (object, optional): Same as findDuplicates options

Returns:

fileCount (number): Total files scanned
duplicateCount (number): Number of duplicate groups
duplicateSize (number): Total bytes in duplicates
totalSize (number): Total bytes in directory
duplicateRatio (number): Percentage of files that are duplicates
scanDuration (number): Time taken to scan (ms)

Use Cases

Digital Hoarder Cleanup

Find duplicate photos/videos
Identify wasted storage space
Remove old duplicates, keep newest
Clean up download folders

Document Management

Find duplicate PDFs, docs, reports
Keep latest version, archive old versions
Prevent version confusion
Reduce backup bloat

Project Cleanup

Find duplicate source files
Remove duplicate build artifacts
Clean up node_modules duplicates
Save storage on SSD/HDD

Backup Optimization

Find duplicate backup files
Remove redundant backups
Identify what's actually duplicated
Save space on backup drives

Configuration

Edit config.json to customize:

{
  "detection": {
    "defaultMethod": "content",
    "sizeTolerancePercent": 0,
    "nameSimilarity": 0.7,
    "includeSubdirs": true
  },
  "removal": {
    "defaultAction": "delete",
    "defaultKeep": "newest",
    "archivePath": "./archive",
    "sizeThreshold": 10485760,
    "autoConfirm": false
  },
  "exclude": {
    "patterns": [".git", "node_modules", ".vscode", ".idea"],
    "whitelist": ["important", "work", "projects"]
  }
}

Performance

Scanning Speed

Small directories (<1000 files): <1s
Medium directories (1000-10000 files): 1-5s
Large directories (10000+ files): 5-20s

Detection Accuracy

Content-based: 100% (exact duplicates)
Size-based: Fast but may miss renamed files
Name-based: Detects naming patterns only

Memory Usage

Hash cache: ~1MB per 100,000 files
Batch processing: Processes 1000 files at a time
Peak memory: ~100MB for 1M files

Safety Features

Size Thresholding

Won't remove files larger than configurable threshold (default: 10MB). Prevents accidental deletion of important large files.

Archive Mode

Move files to archive directory instead of deleting. No data loss, full recoverability.

Action Logging

All deletions/moves are logged to file for recovery and audit.

Tips

Best Results

Use content-based detection for documents (100% accurate)
Use size-based for media files (faster)
Run dry-run first to preview changes
Archive instead of delete for important files

Performance Optimization

Process frequently used directories first
Use size threshold to skip large media files
Exclude hidden directories from scan
Process directories in parallel when possible

Troubleshooting

Not Finding Expected Duplicates

Check detection method (content vs size vs name)
Verify exclude patterns aren't too broad
Check if files are in whitelisted directories

Removal Not Working

Check write permissions on directories
Verify action isn't 'delete' with autoConfirm: true
Check size threshold isn't blocking all deletions
Check file locks (is another program using files?)

Slow Scanning

Reduce includeSubdirs scope
Use size-based detection (faster)
Exclude large directories (node_modules, .git)
Process directories individually instead of batch

License

MIT

Find duplicates. Save space. Keep your system clean. 🔮

Permissions & Security

Security level L1: Low-risk skills with minimal permissions. Review inputs and outputs before running in production.

Requirements

OpenClaw CLI installed and configured.
Language: Markdown
License: MIT
Topics:

Configuration

### Edit `config.json`: ```json { "detection": { "defaultMethod": "content", "sizeTolerancePercent": 0, // exact match only "nameSimilarity": 0.7, // 0-1, lower = more similar "includeSubdirs": true }, "removal": { "defaultAction": "delete", "defaultKeep": "newest", "archivePath": "./archive", "sizeThreshold": 10485760, // 10MB threshold "autoConfirm": false, "dryRunDefault": false }, "exclude": { "patterns": [".git", "node_modules", ".vscode", ".idea"], "whitelist": ["important", "work", "projects"] } } ```

FAQ

How do I install file-deduplicator?

Run openclaw add @michael-laffin/file-deduplicator in your terminal. This installs file-deduplicator into your OpenClaw Skills catalog.

Does this skill run locally or in the cloud?

OpenClaw Skills execute locally by default. Review the SKILL.md and permissions before running any skill.

Where can I verify the source code?

The source repository is available at https://github.com/openclaw/skills/tree/main/skills/michael-laffin/file-deduplicator. Review commits and README documentation before installing.

file-deduplicator – OpenClaw Skill

Skill Snapshot

Maintainer

File-Deduplicator - Find and Remove Duplicates

Overview

Features

✅ Duplicate Detection

✅ Removal Options

✅ Analysis Tools

✅ Safety Features

Installation

Quick Start

Find Duplicates in Directory

Remove Duplicates Automatically

Dry-Run Preview

Tool Functions

findDuplicates

removeDuplicates

analyzeDirectory

Use Cases

Digital Hoarder Cleanup

Document Management

Project Cleanup

Backup Optimization

Configuration

Edit config.json:

Methods

Content-Based (Recommended)

Size-Based

Name-Based

Examples

Find Duplicates in Documents

Remove Duplicates, Keep Newest

Move to Archive Instead of Delete

Dry-Run Preview Changes

Performance

Scanning Speed

Detection Accuracy

Memory Usage

Safety Features

Size Thresholding

Archive Mode

Action Logging

Undo Functionality

Error Handling

Permission Errors

File Lock Errors

Space Errors

Troubleshooting

Not Finding Expected Duplicates

Deletion Not Working

Slow Scanning

Tips

Best Results

Performance Optimization

Space Management

Roadmap

License

File-Deduplicator

Quick Start

Usage Examples

Find All Types of Duplicates

Content-Based Duplicates (Most Accurate)

Size-Based Duplicates (Fastest)

Name-Based Duplicates (Find Renamed Copies)

Tool Functions

findDuplicates

removeDuplicates

analyzeDirectory

Use Cases

Digital Hoarder Cleanup

Document Management

Project Cleanup

Backup Optimization

Configuration

Performance

Scanning Speed

Detection Accuracy

Memory Usage

Safety Features

`findDuplicates`

`removeDuplicates`

`analyzeDirectory`

Edit `config.json`:

`findDuplicates`

`removeDuplicates`

`analyzeDirectory`