PHP the code that exports a web page as an Word document

  • 2020-05-16 06:37:38
  • OfStack

There are 2 ways to export doc documents, 1 is to use com, install it on the server as an extension of php, then create an com and call its methods. A server with office installed can call an com called word.application and generate an word document, which I do not recommend because it is inefficient to execute (I tested 1 and the server actually opened an word client while executing the code). Ideally, com should have no interface and do data conversion in the background, which would be nice, but these extensions are generally charged.

The second method is to use PHP to directly write the contents of our doc document to a file with the suffix doc. This method does not rely on third party extensions and is efficient to use.

word itself is quite powerful. It can open files in html format and retain the format. Even if the suffix is doc, it can recognize the normal opening. This provides us with convenience. However, there is one problem. The image in the html file only has one address, and the real image is saved elsewhere. That is to say, if the HTML format is written to doc, then doc will not be able to contain the image. So how do we create an doc document with images? We can use mht, which is very close to html.

The mht format is similar to html, except that in the mht format, externally linked files such as images, Javascript, CSS are encoded and stored by base64. As a result, a single mht file can hold all the resources in a single web page, although it is also larger than html.

Can mht be recognized by word? I saved a web page as mht, changed the suffix to doc, and opened it with word. OK and word can also recognize mht files and display pictures.

Well, now that doc can recognize mht, it's time to consider how to put images in mht. Since the address of the image in the html code is written in the src attribute of the img tag, the image address can be obtained by extracting the src attribute value in the html code. Of course, it's possible that you're getting a relative path, but that's ok, just use the URL prefix and change it to an absolute path. With the image address, we can obtain the specific content of the image file through file_get_content function, and then call base64_encode function to encode the file content into base64 code, and finally insert it into the appropriate location of the mht file.

Finally, there are two ways to send the file to the client. One is to generate an doc document on the server side and record the address of the doc document. You can ask the client to download this doc. Another way is to directly send html request, modify header part of HTML protocol, set content-type to application/doc, set content-disposition to attachment, follow the file name, after sending html protocol, directly send the file content to the client, or let the client download to this doc document.

implementation

Through the introduction of the above principles, I believe that you should have a preliminary understanding of the implementation process, next I give an export function, this function can be exported to the HTML code into an mht document, there are three parameters, of which the last two are optional parameters
content: the HTML code to convert
absolutePath: if the image addresses in the HTML code are all relative paths, then this parameter is the absolute path missing from the HTML code.
isEraseLink: whether to remove hyperlinks from the HTML code
Returns the contents of a file with a value of mht, which you can save as a file with the suffix doc via file_put_content
The main function of this function is to analyze all the image addresses in the HTML code and download them in sequence. Once you get the content of the image, call the MhtFileMaker class and add the image to the mht file. The details are encapsulated in the MhtFileMaker class.
 
/** 
*  According to the HTML The code for word The document content  
*  create 1 An essential for mht The function analyzes the contents of the file and downloads the image resources from the page remotely  
*  This function depends on the class MhtFileMaker 
*  This function will analyze img Tag, extract src Property value. However, src Attribute values must be enclosed in quotation marks, otherwise they cannot be extracted  
* 
* @param string $content HTML content  
* @param string $absolutePath  The absolute path of a web page. if HTML The image path in the content is a relative path, so you need to fill in this parameter to make the function automatically fill in the absolute path. This parameter is finally required by / The end of the  
* @param bool $isEraseLink  Whether to remove the HTML Links in content  
*/ 
function getWordDocument( $content , $absolutePath = "" , $isEraseLink = true ) 
{ 
$mht = new MhtFileMaker(); 
if ($isEraseLink) 
$content = preg_replace('/<a\s*.*?\s*>(\s*.*?\s*)<\/a>/i' , '$1' , $content); // Remove the link  
$images = array(); 
$files = array(); 
$matches = array(); 
// This algorithm requires src The subsequent attribute values must be enclosed in quotation marks  
if ( preg_match_all('/<img[.\n]*?src\s*?=\s*?[\"\'](.*?)[\"\'](.*?)\/>/i',$content ,$matches ) ) 
{ 
$arrPath = $matches[1]; 
for ( $i=0;$i<count($arrPath);$i++) 
{ 
$path = $arrPath[$i]; 
$imgPath = trim( $path ); 
if ( $imgPath != "" ) 
{ 
$files[] = $imgPath; 
if( substr($imgPath,0,7) == 'http://') 
{ 
// Absolute link, unprefixed  
} 
else 
{ 
$imgPath = $absolutePath.$imgPath; 
} 
$images[] = $imgPath; 
} 
} 
} 
$mht->AddContents("tmp.html",$mht->GetMimeType("tmp.html"),$content); 
for ( $i=0;$i<count($images);$i++) 
{ 
$image = $images[$i]; 
if ( @fopen($image , 'r') ) 
{ 
$imgcontent = @file_get_contents( $image ); 
if ( $content ) 
$mht->AddContents($files[$i],$mht->GetMimeType($image),$imgcontent); 
} 
else 
{ 
echo "file:".$image." not exist!<br />"; 
} 
} 
return $mht->GetFile(); 
} 

Usage:
 
$fileContent = getWordDocument($content,"http://www.yoursite.com/Music/etc/"); 
$fp = fopen("test.doc", 'w'); 
fwrite($fp, $fileContent); 
fclose($fp); 

The $content variable should be the HTML source code, and the link behind it should be the URL address that fills the relative path of the image in the HTML code
Note that before using this function, you need to include the MhtFileMaker class, which helps us generate the Mht document.
 
<?php 
/*********************************************************************** 
Class: Mht File Maker 
Version: 1.2 beta 
Date: 02/11/2007 
Author: Wudi <wudicgi@yahoo.de> 
Description: The class can make .mht file. 
***********************************************************************/ 
class MhtFileMaker{ 
var $config = array(); 
var $headers = array(); 
var $headers_exists = array(); 
var $files = array(); 
var $boundary; 
var $dir_base; 
var $page_first; 
function MhtFile($config = array()){ 
} 
function SetHeader($header){ 
$this->headers[] = $header; 
$key = strtolower(substr($header, 0, strpos($header, ':'))); 
$this->headers_exists[$key] = TRUE; 
} 
function SetFrom($from){ 
$this->SetHeader("From: $from"); 
} 
function SetSubject($subject){ 
$this->SetHeader("Subject: $subject"); 
} 
function SetDate($date = NULL, $istimestamp = FALSE){ 
if ($date == NULL) { 
$date = time(); 
} 
if ($istimestamp == TRUE) { 
$date = date('D, d M Y H:i:s O', $date); 
} 
$this->SetHeader("Date: $date"); 
} 
function SetBoundary($boundary = NULL){ 
if ($boundary == NULL) { 
$this->boundary = '--' . strtoupper(md5(mt_rand())) . '_MULTIPART_MIXED'; 
} else { 
$this->boundary = $boundary; 
} 
} 
function SetBaseDir($dir){ 
$this->dir_base = str_replace("\\", "/", realpath($dir)); 
} 
function SetFirstPage($filename){ 
$this->page_first = str_replace("\\", "/", realpath("{$this->dir_base}/$filename")); 
} 
function AutoAddFiles(){ 
if (!isset($this->page_first)) { 
exit ('Not set the first page.'); 
} 
$filepath = str_replace($this->dir_base, '', $this->page_first); 
$filepath = 'http://mhtfile' . $filepath; 
$this->AddFile($this->page_first, $filepath, NULL); 
$this->AddDir($this->dir_base); 
} 
function AddDir($dir){ 
$handle_dir = opendir($dir); 
while ($filename = readdir($handle_dir)) { 
if (($filename!='.') && ($filename!='..') && ("$dir/$filename"!=$this->page_first)) { 
if (is_dir("$dir/$filename")) { 
$this->AddDir("$dir/$filename"); 
} elseif (is_file("$dir/$filename")) { 
$filepath = str_replace($this->dir_base, '', "$dir/$filename"); 
$filepath = 'http://mhtfile' . $filepath; 
$this->AddFile("$dir/$filename", $filepath, NULL); 
} 
} 
} 
closedir($handle_dir); 
} 
function AddFile($filename, $filepath = NULL, $encoding = NULL){ 
if ($filepath == NULL) { 
$filepath = $filename; 
} 
$mimetype = $this->GetMimeType($filename); 
$filecont = file_get_contents($filename); 
$this->AddContents($filepath, $mimetype, $filecont, $encoding); 
} 
function AddContents($filepath, $mimetype, $filecont, $encoding = NULL){ 
if ($encoding == NULL) { 
$filecont = chunk_split(base64_encode($filecont), 76); 
$encoding = 'base64'; 
} 
$this->files[] = array('filepath' => $filepath, 
'mimetype' => $mimetype, 
'filecont' => $filecont, 
'encoding' => $encoding); 
} 
function CheckHeaders(){ 
if (!array_key_exists('date', $this->headers_exists)) { 
$this->SetDate(NULL, TRUE); 
} 
if ($this->boundary == NULL) { 
$this->SetBoundary(); 
} 
} 
function CheckFiles(){ 
if (count($this->files) == 0) { 
return FALSE; 
} else { 
return TRUE; 
} 
} 
function GetFile(){ 
$this->CheckHeaders(); 
if (!$this->CheckFiles()) { 
exit ('No file was added.'); 
} 
$contents = implode("\r\n", $this->headers); 
$contents .= "\r\n"; 
$contents .= "MIME-Version: 1.0\r\n"; 
$contents .= "Content-Type: multipart/related;\r\n"; 
$contents .= "\tboundary=\"{$this->boundary}\";\r\n"; 
$contents .= "\ttype=\"" . $this->files[0]['mimetype'] . "\"\r\n"; 
$contents .= "X-MimeOLE: Produced By Mht File Maker v1.0 beta\r\n"; 
$contents .= "\r\n"; 
$contents .= "This is a multi-part message in MIME format.\r\n"; 
$contents .= "\r\n"; 
foreach ($this->files as $file) { 
$contents .= "--{$this->boundary}\r\n"; 
$contents .= "Content-Type: $file[mimetype]\r\n"; 
$contents .= "Content-Transfer-Encoding: $file[encoding]\r\n"; 
$contents .= "Content-Location: $file[filepath]\r\n"; 
$contents .= "\r\n"; 
$contents .= $file['filecont']; 
$contents .= "\r\n"; 
} 
$contents .= "--{$this->boundary}--\r\n"; 
return $contents; 
} 
function MakeFile($filename){ 
$contents = $this->GetFile(); 
$fp = fopen($filename, 'w'); 
fwrite($fp, $contents); 
fclose($fp); 
} 
function GetMimeType($filename){ 
$pathinfo = pathinfo($filename); 
switch ($pathinfo['extension']) { 
case 'htm': $mimetype = 'text/html'; break; 
case 'html': $mimetype = 'text/html'; break; 
case 'txt': $mimetype = 'text/plain'; break; 
case 'cgi': $mimetype = 'text/plain'; break; 
case 'php': $mimetype = 'text/plain'; break; 
case 'css': $mimetype = 'text/css'; break; 
case 'jpg': $mimetype = 'image/jpeg'; break; 
case 'jpeg': $mimetype = 'image/jpeg'; break; 
case 'jpe': $mimetype = 'image/jpeg'; break; 
case 'gif': $mimetype = 'image/gif'; break; 
case 'png': $mimetype = 'image/png'; break; 
default: $mimetype = 'application/octet-stream'; break; 
} 
return $mimetype; 
} 
} 
?> 

Related articles: