Match the overlap between the csdn user database and the official user and filter out users that overlap

  • 2020-05-10 17:52:33
  • OfStack

Process:
1. Get the user database of csdn imported locally
Try editplus, open the prompt out of memory, did not find a way, colleagues under linux to view 1, the basic format is as follows:
Username # password # mailbox
Username # password # mailbox
Corresponding data structure:
 
CREATE TABLE IF NOT EXISTS `csdn_userdb` ( 
  `id` int(10) NOT NULL auto_increment, 
  `username` varchar(50) character set gbk NOT NULL, 
   `password` varchar(50) character set gbk NOT NULL, 
  `email` varchar(50) character set gbk NOT NULL, 
   PRIMARY KEY (`id`), 
  KEY `username` (`username`), 
  KEY `email` (`email`) 
  ) ENGINE=MyISAM DEFAULT CHARSET=gbk AUTO_INCREMENT=1 ; 

1. It is suspected that fopen open file is written to cache, but it has been proved to be very fast in practice, so it should not be written to cache. The following is the code to import data
 
<?php 
$link = mysql_connect('localhost', 'root', 'admin', true); 
mysql_select_db('csdn',$link); 
$handle = fopen("C:\Users\zhudong\Desktop\www.csdn.net.sql", "r"); 
while (!feof($handle)){ 
$i++; 
$buffer = fgets($handle); 
list($u,$p,$e) = explode(" # ",$buffer); 
mysql_query("INSERT INTO csdn_userdb(username,password,email) VALUES ('$u','$p','$e')",$link); 
if ($i%1000 == 0) echo $i."\n"; 
} 
fclose($handle); 
?> 

The above code is very inefficient, so the modified code is as follows:
 
<?php 
$link = mysql_connect('localhost', 'root', 'admin', true); 
mysql_select_db('csdn',$link); 
$handle = fopen("C:\Users\zhudong\Desktop\www.csdn.net.sql", "r"); 
$perpage = 50; 
while (!feof($handle)){ 
$i++; 
$buffer = fgets($handle); 
list($u,$p,$e) = explode(" # ",$buffer); 
$insertValue[] = "('$u','$p','$e')"; 
if ($i% $perpage == 0){ $perpage == 0){ 
$instrtValueString = implode(',',$insertValue); 
mysql_query("INSERT INTO csdn_userdb(username,password,email) VALUES $instrtValueString",$link); 
echo $i."\n"; 
$insertValue = array(); 
} 
} 
fclose($handle); 

In order to understand what factors affect the efficiency of the imported data, tests were conducted according to different Settings
The total number of CSDN users is 6428600
When $perpage = 500; Data after import: 5,902,000; Data loss rate: 8%; Data table engine: MyISAM; Index: yes; Total time: 15 minutes
When $perpage=200, the total amount of data imported: 6,210,200; Data loss: 218400; Loss rate: 3.3%; Data table engine: MYISAM; Index: yes; Total time: 30 minutes
When $perpage=200, the total amount of data imported: 6,210,200; Data loss: 218400; Loss rate: 3.3%; Data table engine: INNODB; Index: yes; Total time: 65 minutes
When $perpage=200, the total amount of data imported: 6,210,200; Data loss: 218400; Loss rate: 3.3%; Data table engine: MYISAM; Index: none; Total time: 14 minutes (re-index separately after data import)
When $perpage=50, the total data after import: 6,371,200; Data loss: 57400, loss rate: 0.8%; Data table engine: MYISAM; Index: none: total time: 20 minutes
According to the above situation, it is summarized as follows:
1, the efficiency of the index after the pilot input data is 1 times higher than that after the index is added
2. InnoDB is much less efficient than MYISAM at single-process data insertion
3. When perpage=50, the data loss rate is below 1%
 
 Because execution through a browser has a timeout problem and is inefficient, it is run from the command line, which is encountered during this process 1 A little trouble took some time  
 At first I executed the following code:  
php.exe E:\usr\www\importcsdndb.php 
 but 1 Keep an error: call to undefined function mysql_connect 
 Toss found no load php.ini 
 The correct code is:  
php.exe -c E:/usr/local/apache2/php.ini importcsdndb.php 

2. Import the matching user data to the local
Command line enter msyql (not baidu itself)
Then execute: mysql > source C:/Users/zhudong/Desktop/userdb.sql
3. Compare and screen users
Once the program is written, remember to run it from the command line:
 
<?php 
$link = mysql_connect('localhost', 'root', 'admin', true); 
mysql_select_db('csdn',$link); 
$handle_username = fopen("E:/records_username.txt","a"); 
//$handle_email = fopen("E:/records_email.txt","a"); 
$username_num = $email_num = $uid = 0; 
while ($uid<2181106) { 
$nextuid=$uid+10000; 
$query = mysql_query("SELECT * FROM pw_members WHERE uid>'$uid' AND uid<'$nextuid'"); 
while ($rt = mysql_fetch_array($query,MYSQL_ASSOC)) { 
$username = $rt['username']; 
$email = $rt['email']; 
$query2 = mysql_query("SELECT * FROM scdn_userdb WHERE username='$username' OR email='$email'"); 
while ($rt2 = mysql_fetch_array($query2,MYSQL_ASSOC)) { 
if ($rt['password'] = md5($rt2['password'])) { 
if ($rt2['username'] == $username) { 
$username_num++; 
fwrite($handle_username,'OWN:'.$rt['uid'].'|'.$rt['username'].'|'.$rt['password'].'|'.$rt['email'].' CSDN:'.$rt2['username'].'|'.$rt2['password'].'|'.$rt2['email']."\r\n"); 
echo 'username_num='.$username_num."\r\n"; 
continue; 
} 
/* 
if ($rt2['email'] == $email) { 
$email_num++; 
fwrite($handle_email,'OWN:'.$rt['uid'].'|'.$rt['username'].'|'.$rt['password'].'|'.$rt['email'].' CSDN:'.$rt2['username'].'|'.$rt2['password'].'|'.$rt2['email']."\r\n"); 
echo 'email_num='.$email_num."\r\n"; 
} 
*/ 
} 
} 
mysql_free_result($query2); 
} 
$uid = $nextuid; 
} 
?> 

You see the above code is very poor, because its efficiency is particularly low, millions of data, to run for more than 10 hours, how can you forget to connect the table query such a basic thing, the following is the revised method
 
$link = mysql_connect('localhost', 'root', 'admin', true); 
mysql_select_db('csdn',$link); 
$handle_username = fopen("E:/records_username.txt","a"); 
while($uid<2181106) {// The number here is the maximum user library to compare ID 
$nextuid= $uid+10000; 
$query = mysql_query("SELECT m.uid,m.username,m.password,m.email,u.password as csdn_password,u.email as csdn_email FROM own_members m LEFT JOIN csdn_userdb u USING(username) WHERE m.uid>'$uid' AND m.uid<='$nextuid' AND u.username!=''"); 
while ($rt = mysql_fetch_array($query,MYSQL_ASSOC)) { 
if ($rt['password'] == md5($rt['csdn_password'])) { 
$username_num++; 
fwrite($handle_username,'OWN:'.$rt['uid'].'|'.$rt['username'].'|'.$rt['password'].'|'.$rt['email'].' CSDN:'.$rt['username'].'|'.$rt['csdn_password'].'|'.$rt['csdn_email']."\r\n"); 
echo 'username_num='.$username_num."\r\n"; 
} 
} 
$uid = $nextuid; 
echo 'uid='.$uid; 
} 
?> 

The total comparison time is 25 minutes, which is a big improvement over the previous 10-plus hours of execution
Total user: 34175
Percentage of total members: 1.7%
1.7% of the heavy user is quite serious, I hope this article to you webmaster comparison site users help

Related articles: