We have created a document library system to house digital copies of important documents for our client groups. Prior to the creation of the document library the client groups would store documents in nested file system folders. Over time this repository grew to a few hundred thousand documents. With a hunderds of thousands of document and thousands of folder it's easy to see that this was out of hand. There were many reports of folders accidentally dropped in other folders and so on. Searching for individual documents was time and resource consuming. Right off the bat we were able to add documents to a newly created document library in SharePoint 2007 with no issues. But what about the existing documents on the file shares? How would they get into SharePoint. A person (or team of people) would require considerable resources and time to manually input a few hundred thousand documents into SahrePoint. Enter Powershell.
For this solution I am importing pdf, doc, and even some gif and jpg files from a file share into a SharePoint 2007 document library. There are about 12 custom columns in the library also. I am using Powershell 1.0.
Overview
In the following script I will be grabbing properties of a clooection of documents, massaging some of the detailed data, storing them in local variables, and finally gaining access to the sharepoint library and inserting them to the sharepoint library.
Let's look at the script them rip it apart step by step.
$gci = gci "\\server\share" -recurse -force | ? {$_.lastwritetime -ge "01.01.1900"-and $_.attributes -ne "directory" -and $_.extension -ne ".url"}
foreach
($file in $gci)
{
#begin
$acct = $file.fullname.replace("-","").substring(($file.directoryname.lastindexofany("\") ),9)
$date = $file.creationtime
$repl_filename = $file.name
$repl2 = $repl_filename.replace("&","and")
# there were more here
$repl12 = $repl2.replace("..",".")
$repl_filefullname = $file.directoryname + "\" + $repl12
$propbag=@{”ContentType”=”Document”
“Name” = $repl11
"Property1" = $acct.replace("\","")
"Property2" = "Reviewed"
"Property3" = $date
"Property4" = "Legacy Import"}
$docliburl=”http://full url to doc library”
$relweburl=”/folder containing doc library”
[System.Reflection.Assembly]::LoadWithPartialName(”Microsoft.SharePoint”) > $null
$site=new-object Microsoft.SharePoint.SPSite($docliburl)
$web=$site.openweb($relweburl)
$folder=$web.getfolder($docliburl)
$list=$web.lists[$folder.ContainingDocumentLibrary]
#process
$stream=[IO.File]::OpenRead($file.fullname)
$folder.files.Add($repl11,$stream,$propbag, $true) > $null
$stream.close()
}
WOW! You might be thinking that is ugly! Remember where beauty lays?
A word about variables in Powershell.
Powershell understands Objects. Powershell works with Objects. a variable is an object. maybe it is an object of type string or int32, maybe it is a custom object or an array. All variables in Powershell are prefixed with a $ symbol.
A word about shorthand in Powershell.
Powershell has lots of shorthand. The more I use Powershell, the more I learn about available shorthand. I will try to explain as many as I can here.
Let's get started!
$gci = gci "\\server\share" -recurse -force | ? {$_.lastwritetime -ge "01.01.1900"-and $_.attributes -ne "directory" -and $_.extension -ne ".url"}
$gci Here I am declaring a variable named $gci it is of type what comes next.
gci "\\server\share" -recurse -force | ? {$_.lastwritetime -ge "01.01.1900"-and $_.attributes -ne "directory" -and $_.extension -ne ".url"}gci is shorthand for get-childitem. Here I am using get-childitem to retrieve files from a shared directory. In this example it is "\\server\share"
-recurse Traditionally it means rerun, here this allows me to search all directories and sub directories.
-force allows me not to get tangled up in open files etc.
| (pipe) in Powershell the | character is a seperator. It allows me to seperate different sections of a command.
? this begins the filter or where clause. the keyword "where" could also be used.
{ begin block
$_ In Powershell, in dealing with variables there is a concept of a pipeline object. This is the current object being accessed. instead of using the object or variable name, you can use the $_ convention.
$_.lastwritetime Here I am using the current object property "lastwritetime".
A word about properties in Sharepoint.
SharePoint uses and understands objects. As with traditional "OO" design objects are derived from 2 things properties and methods. A true object has "details" and does "stuff". Powershell objects follow this principal. How can you see what methods and properties an object has? I'm glad you asked.
get-member.
get member is an necessary command for discovering more about any PowerShell object. Because you are at the command line, you cannot right click an object and check its properties and methods. Instead, what you can do is type: Get-Object | Get-Member. In this case:
gci | get-memberThat's too much information for what we are working with right now. luckily PowerShell is a self describing language. So we can ask it a more specific question:
gci | get-member -Membertype propertythis will only return properties and my ability to read or write them.
gci | get-member -Membertype methodthis will retuen methods that are accessable to the object queried with get-member.
Remember, Powershell understands objects. Different properties and methods will be returned for whatever type of object you query with get-member.
back to $_.lastwritetime. Here I am using the lastwrite property of the current pipeline object which is of type get-childitem in my filter or where clause.
-ge is shorthand for greater than or equal to. more examples of usable shorthand include -eq means equal, -ne means not equal, etc...
-and is the command allowing me to add an additional condition to my filter oe where clause.
$_.attributes -ne "directory" -and $_.extension -ne ".url"} With the rest here I am filtering out root directories and files with a url extension (shortcuts).
foreach ($file in $gci) This uses a pretty standard construct in almost any programming language. The foreach loop. Here I am declaring a new variable ($file) and I am dropping that into my foreach loop.
$acct = $file.fullname.replace("-","").substring(($file.directoryname.lastindexofany("\") ),9) Here I am Creating a new variable and filling it with a string that has been massaged. I am replacing characters and performing string manipulation of the directoryname property of the $file object.
$date = $file.creationtime Here I am declaring a new variable and I am populating it with the creationdate property of the $file object.
$repl_filename = $file.name
$repl2 = $repl_filename.replace("&","and")
$repl12 = $repl2.replace("..",".") here I have some business rules and logic for what the finished filename will look like.
It turns out that SharePoint will not store documents named with certain characters even though the OS is ok with some of these characters. In my experience, these characters include but may not be limited to /#,+()$~%.
Reminds me of Qbert :)
$propbag=@{ Here I am declaring a variable of type array and next I declare it's elements. These elements are named exactly as the custom columns in the sharepoint list or library. This is necessary for them to properly populate the list or library when we add the document and it's properties in the next few lines of code.
”ContentType”=”Document”
“Name” = $repl11
"Property1" = $acct.replace("\","")
"Property2" = "Reviewed"
"Property3" = $date
"Property4" = "Legacy Import" Here I am setting custom property names and I am populating them with values. You can see that I am both passing ion variables for the values and "hardcoding" some of the values.
$docliburl=”http://full url to doc library”
$relweburl=”/folder containing doc library”
[System.Reflection.Assembly]::LoadWithPartialName(”Microsoft.SharePoint”) > $null
$site=new-object Microsoft.SharePoint.SPSite($docliburl)
$web=$site.openweb($relweburl)
$folder=$web.getfolder($docliburl)
$list=$web.lists[$folder.ContainingDocumentLibrary]I have alot going on here. This all is necessary for connectivity and access to the object in order to manipulate items in the SharePoint document library or list. Here I am declaring and populating variables for access to the correct list or library. I need access the the sharepoint methods and properties so in this case I need to load data from the sharepoint dll directly. Here I use System.Reflection to accomplish this.
$stream=[IO.File]::OpenRead($file.fullname)
$folder.files.Add($repl11,$stream,$propbag, $true) > $null
$stream.close()Now for the meat. Here I am declaring a stream and I am populating it with the actual filesystem file object which is the file. Next I am adding the stream and the array of properties (propbag) to the sharepoint list. Then I close my stream.
Using this script I was able to import 200k + documents from a disjoined malformed windows directory share into a SharePoint 2007 document library and fill out custom properties such as the date the file was created and when it was last accessed. I was also able to clean up filenames. The best item was that I was able to extrapolate a unique identifier out of the folder or file name to use as a search identifier in the document library.
That's alot! I hope that you can use this code as a spring board for your own powershell and/or SharePoint projects.
happy coding!
Labels: SharePoint insert Powershell Document Library List