Identifying Duplicate Files Across All SharePoint Sites Using PowerShell

Managing a SharePoint environment can be a complex task, especially when it comes to ensuring that your storage is being used efficiently. Duplicate files across various sites and document libraries can quickly consume valuable space, slow down search results, and create confusion among users. Fortunately, with the power of PowerShell and the PnP (Patterns and Practices) PowerShell module, you can automate the process of identifying and removing these duplicates. In this blog, we’ll walk you through a PowerShell script that scans all sites in a SharePoint tenant, identifies duplicate files, and generates a comprehensive report.

Why Remove Duplicate Files?

Duplicate files can arise for various reasons—users might upload the same file to different libraries or sites, or versioning might result in multiple copies of similar documents. Whatever the cause, these duplicates can:

  • Waste Storage Space: SharePoint storage can be costly, especially if you’re on a limited storage plan.
  • Decrease Performance: More files mean longer indexing times, slower searches, and overall reduced performance.
  • Cause User Confusion: Multiple copies of the same document can lead to version conflicts and make it harder for users to find the correct file.

By regularly identifying and removing duplicates, you can keep your SharePoint environment lean, efficient, and user-friendly.

The PowerShell Script

The PowerShell script below requires PowerShell V7, and is designed to connect to all sites in your SharePoint Online tenant, search through all document libraries for duplicate files, and export the results to a CSV file for further action.

#Parameters
$TenantAdminSiteURL = "https://yourtenantname-admin.sharepoint.com"
$PageSize = 2000
$ReportOutput = "C:\Temp\DuplicateFilesReport.csv"
  
#Connect to SharePoint Online tenant admin site
Connect-PnPOnline $TenantAdminSiteURL -Interactive

#Get all site collections
$Sites = Get-PnPTenantSite

#Array to store results
$DataCollection = @()

#Iterate through each site collection
ForEach($Site in $Sites) {
    Write-Host "Processing Site: $($Site.Url)"
    
    #Connect to the site
    Connect-PnPOnline $Site.Url -Interactive
    
    #Get all Document libraries
    $DocumentLibraries = Get-PnPList | Where-Object {$_.BaseType -eq "DocumentLibrary" -and $_.Hidden -eq $false -and $_.ItemCount -gt 0 -and $_.Title -Notin("Site Pages","Style Library", "Preservation Hold Library")}
    
    #Iterate through each document library
    ForEach($Library in $DocumentLibraries)
    {
        #Get All documents from the library
        $global:counter = 0;
        $Documents = Get-PnPListItem -List $Library -PageSize $PageSize -Fields ID, File_x0020_Type -ScriptBlock `
            { Param($items) $global:counter += $items.Count; Write-Progress -PercentComplete ($global:Counter / ($Library.ItemCount) * 100) -Activity `
                 "Getting Documents from Library '$($Library.Title)'" -Status "Getting Documents data $global:Counter of $($Library.ItemCount)";} | Where {$_.FileSystemObjectType -eq "File"}
        
        $ItemCounter = 0
        #Iterate through each document
        Foreach($Document in $Documents)
        {
            #Get the File from Item
            $File = Get-PnPProperty -ClientObject $Document -Property File
    
            #Get The File Hash
            $Bytes = $File.OpenBinaryStream()
            Invoke-PnPQuery
            $MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
            $HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))
    
            #Collect data       
            $Data = New-Object PSObject
            $Data | Add-Member -MemberType NoteProperty -Name "SiteURL" -Value $Site.Url
            $Data | Add-Member -MemberType NoteProperty -Name "FileName" -Value $File.Name
            $Data | Add-Member -MemberType NoteProperty -Name "HashCode" -Value $HashCode
            $Data | Add-Member -MemberType NoteProperty -Name "URL" -Value $File.ServerRelativeUrl
            $Data | Add-Member -MemberType NoteProperty -Name "FileSize" -Value $File.Length       
            $DataCollection += $Data
            $ItemCounter++
            Write-Progress -PercentComplete ($ItemCounter / ($Library.ItemCount) * 100) -Activity "Collecting data from Documents $ItemCounter of $($Library.ItemCount) from $($Library.Title)" `
                         -Status "Reading Data from Document '$($Document['FileLeafRef']) at '$($Document['FileRef'])"
        }
    }
}

#Get Duplicate Files by Grouping Hash code
$Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1} | Select -ExpandProperty Group
Write-host "Duplicate Files Based on File Hashcode:"
$Duplicates | Format-table -AutoSize

#Export the duplicates results to CSV
$Duplicates | Export-Csv -Path $ReportO
Code language: PowerShell (powershell)

Categories:

No responses yet

Leave a Reply

Your email address will not be published. Required fields are marked *