Folks,
As we all know, through the years spam has been an ongoing problem. We dealt with it OK, as we valued the openness of our wiki. That's good.
However, all this churn has accumulated a lot of garbage on the system, as orphaned pages and dummy users. These have accumulated to such an extent that we are reaching system limits.
I need your help to clean them out. First, we have to separate the spam from the real content. I've created a Google Spreadsheet for pages:
https://docs.google.com/a/lattica.com/spreadsheet/ccc?key=0AmY-Kp_Ihu3idFNEO...
and one for users:
https://docs.google.com/a/lattica.com/spreadsheet/ccc?key=0AmY-Kp_Ihu3idGdPY...
We need to mark spam as "Spam" and what's good as "OK" in the "Spam" column. I'll take that data and apply it to the wiki.
Please let me know if we can do this any simpler or if there are any problems.
Thanks!
On Mon, Jan 14, 2013 at 9:38 AM, Dimi Paun dimi@lattica.com wrote:
... Please let me know if we can do this any simpler or if there are any problems.
Do you want us to move marked items up to the top of the spreadsheet or will you do that for us?
Erich
On 01/14/2013 11:43 AM, Erich E. Hoover wrote:
On Mon, Jan 14, 2013 at 9:38 AM, Dimi Paun dimi@lattica.com wrote:
... Please let me know if we can do this any simpler or if there are any problems.
Do you want us to move marked items up to the top of the spreadsheet or will you do that for us?
I don't think we need to move marked items at the top of the spreadsheet, that's a lot of work/churn.
What we can do instead is that we can move (from time to time) marked items to a separate sheet if that helps (to have them out of the way, and avoid having to scroll too much through the sheet).
Actually, all these ideas are OK with me -- feel free to suggest/do anything that makes sense and makes it easier to sort through all this stuff, all I really care is the data :)
On Mon, 14 Jan 2013, Dimi Paun wrote: [...]
https://docs.google.com/a/lattica.com/spreadsheet/ccc?key=0AmY-Kp_Ihu3idFNEO...
I'm not clear on how this is supposed to work. For instance I see a ton of pages containing 'joyal' or 'crusher' in their Page Name. For instance:
Jaw crusher from joyal jc001j
However a quick search for 'crusher' shows '0 results out of about 21011 pages' (for both a Title and Full Text search). Either I'm searching wrong or these pages have already been deleted. If the latter, then why do we have to manually mark them as Spam again?
MoinMoin creates a dir for every page. I simply got the list by listing these directories. (This is the problem -- there is a limit of 2^15 subdirectories, and this is what we were hitting a few days ago).
Does that answer the question?
Dimi.
On 01/14/2013 12:51 PM, Francois Gouget wrote:
On Mon, 14 Jan 2013, Dimi Paun wrote: [...]
https://docs.google.com/a/lattica.com/spreadsheet/ccc?key=0AmY-Kp_Ihu3idFNEO...
I'm not clear on how this is supposed to work. For instance I see a ton of pages containing 'joyal' or 'crusher' in their Page Name. For instance:
Jaw crusher from joyal jc001j
However a quick search for 'crusher' shows '0 results out of about 21011 pages' (for both a Title and Full Text search). Either I'm searching wrong or these pages have already been deleted. If the latter, then why do we have to manually mark them as Spam again?
On Mon, 14 Jan 2013, Dimi Paun wrote:
MoinMoin creates a dir for every page. I simply got the list by listing these directories. (This is the problem -- there is a limit of 2^15 subdirectories, and this is what we were hitting a few days ago).
Does that answer the question?
It feels like your methodology is flawed. Let's take the "Jaw crusher from joyal jc001j" page as an example.
As far as I can tell that page has already been deleted. MoinMoin knows that. So you should not need us humans to waste time going through 20000+ rows of the spreadsheet to tell you that the directory corresponds to a deleted page.
So is your problem that you want to preserve non-spam deleted pages?
Can't a script go through these directories, notice that the page has been deleted, that the delete comment contains the word 'spam' and then delete the directory?
OK, that's a fair point. Lemme quickly go through that and I'll report back.
Dimi.
On 01/14/2013 01:35 PM, Francois Gouget wrote:
On Mon, 14 Jan 2013, Dimi Paun wrote:
MoinMoin creates a dir for every page. I simply got the list by listing these directories. (This is the problem -- there is a limit of 2^15 subdirectories, and this is what we were hitting a few days ago).
Does that answer the question?
It feels like your methodology is flawed. Let's take the "Jaw crusher from joyal jc001j" page as an example.
As far as I can tell that page has already been deleted. MoinMoin knows that. So you should not need us humans to waste time going through 20000+ rows of the spreadsheet to tell you that the directory corresponds to a deleted page.
So is your problem that you want to preserve non-spam deleted pages?
Can't a script go through these directories, notice that the page has been deleted, that the delete comment contains the word 'spam' and then delete the directory?
Hm, it doesn't seem to be so simple. Each page maintains an edit-log file with all the changes.
grep-ing for -i spam in the edit-log yields less than 400 hits.
Maybe we should look for deleted pages?
Dimi.
On 01/14/2013 01:35 PM, Francois Gouget wrote:
On Mon, 14 Jan 2013, Dimi Paun wrote:
MoinMoin creates a dir for every page. I simply got the list by listing these directories. (This is the problem -- there is a limit of 2^15 subdirectories, and this is what we were hitting a few days ago).
Does that answer the question?
It feels like your methodology is flawed. Let's take the "Jaw crusher from joyal jc001j" page as an example.
As far as I can tell that page has already been deleted. MoinMoin knows that. So you should not need us humans to waste time going through 20000+ rows of the spreadsheet to tell you that the directory corresponds to a deleted page.
So is your problem that you want to preserve non-spam deleted pages?
Can't a script go through these directories, notice that the page has been deleted, that the delete comment contains the word 'spam' and then delete the directory?
Am 14.01.2013 20:00, schrieb Dimi Paun:
Hm, it doesn't seem to be so simple. Each page maintains an edit-log file with all the changes.
grep-ing for -i spam in the edit-log yields less than 400 hits.
Maybe we should look for deleted pages?
Simple idea: Make a backup of the current state, just in case. And yes, remove all deleted. If they are not spam, then there was some other good reason, if someone raises a hand, you can still look into the backup. This should also speed up that old wiki and maybe helps upgrading it (hopefully that'll happen soon :D).
OK, we might be onto something. I've wrote a script to determine the deleted pages: 20162.
Should I just go ahead and nuke those?
Dimi.
On 01/14/2013 01:35 PM, Francois Gouget wrote:
On Mon, 14 Jan 2013, Dimi Paun wrote:
MoinMoin creates a dir for every page. I simply got the list by listing these directories. (This is the problem -- there is a limit of 2^15 subdirectories, and this is what we were hitting a few days ago).
Does that answer the question?
It feels like your methodology is flawed. Let's take the "Jaw crusher from joyal jc001j" page as an example.
As far as I can tell that page has already been deleted. MoinMoin knows that. So you should not need us humans to waste time going through 20000+ rows of the spreadsheet to tell you that the directory corresponds to a deleted page.
So is your problem that you want to preserve non-spam deleted pages?
Can't a script go through these directories, notice that the page has been deleted, that the delete comment contains the word 'spam' and then delete the directory?
On Mon, Jan 14, 2013 at 03:32:40PM -0500, Dimi Paun wrote:
OK, we might be onto something. I've wrote a script to determine the deleted pages: 20162.
Should I just go ahead and nuke those?
Probably, yes.
One common way for spammers to abuse wikis is to intentionally get the pages deleted, so that they live on permanently in the revision history.
See this example from today: It's rightfully deleted as spam: http://wiki.winehq.org/SheriOci But the "Get Info" link leads you to the old revision, with the linkspam intact: http://wiki.winehq.org/SheriOci?action=recall&rev=1
I don't know if there's a way to keep Moinmoin from preserving revision history on deleted pages, but it might be sufficient to simply disable that. I doubt there's much useful history on deleted pages anyway.
Andrew
Am 14.01.2013 21:40, schrieb Andrew Eikum:
On Mon, Jan 14, 2013 at 03:32:40PM -0500, Dimi Paun wrote:
OK, we might be onto something. I've wrote a script to determine the deleted pages: 20162.
Should I just go ahead and nuke those?
Probably, yes.
One common way for spammers to abuse wikis is to intentionally get the pages deleted, so that they live on permanently in the revision history.
See this example from today: It's rightfully deleted as spam: http://wiki.winehq.org/SheriOci But the "Get Info" link leads you to the old revision, with the linkspam intact: http://wiki.winehq.org/SheriOci?action=recall&rev=1
I don't know if there's a way to keep Moinmoin from preserving revision history on deleted pages, but it might be sufficient to simply disable that. I doubt there's much useful history on deleted pages anyway.
seems done, and know what? The wiki seems to be more than 10 times faster. (i tried e.g. editing -> preview) good choice! Thanks Dimi for working on this!!!
Hi guys,
I've cleanup the deleted pages, were down to about 740 pages, mostly good stuff:
https://docs.google.com/a/lattica.com/spreadsheet/ccc?key=0AmY-Kp_Ihu3idFNEO...
Please check it out, lemme know if any spam is still left standing.
Any ideas on how we can attack the spam users?
Dimi.
On 13-01-14 7:16 PM, André Hentschel wrote:
Am 14.01.2013 21:40, schrieb Andrew Eikum:
On Mon, Jan 14, 2013 at 03:32:40PM -0500, Dimi Paun wrote:
OK, we might be onto something. I've wrote a script to determine the deleted pages: 20162.
Should I just go ahead and nuke those?
Probably, yes.
One common way for spammers to abuse wikis is to intentionally get the pages deleted, so that they live on permanently in the revision history.
See this example from today: It's rightfully deleted as spam: http://wiki.winehq.org/SheriOci But the "Get Info" link leads you to the old revision, with the linkspam intact: http://wiki.winehq.org/SheriOci?action=recall&rev=1
I don't know if there's a way to keep Moinmoin from preserving revision history on deleted pages, but it might be sufficient to simply disable that. I doubt there's much useful history on deleted pages anyway.
seems done, and know what? The wiki seems to be more than 10 times faster. (i tried e.g. editing -> preview) good choice! Thanks Dimi for working on this!!!
Hi Dimi,
Dimi Paun dimi@lattica.com wrote:
I've cleanup the deleted pages, were down to about 740 pages, mostly good stuff:
https://docs.google.com/a/lattica.com/spreadsheet/ccc?key=0AmY-Kp_Ihu3idFNEO...
Please check it out, lemme know if any spam is still left standing.
At least the following pages don't exist (and seem to be deleted as spam):
AdamSpang AntoineX56 BoyceUai CharlesYJI CyrilBann
I didn't check futher entries, may be your scrpt can do that?
On 13-01-14 11:11 PM, Dmitry Timoshkov wrote:
Hi Dimi,
Dimi Paun dimi@lattica.com wrote:
I've cleanup the deleted pages, were down to about 740 pages, mostly good stuff:
https://docs.google.com/a/lattica.com/spreadsheet/ccc?key=0AmY-Kp_Ihu3idFNEO...
Please check it out, lemme know if any spam is still left standing.
At least the following pages don't exist (and seem to be deleted as spam):
AdamSpang AntoineX56 BoyceUai CharlesYJI CyrilBann
I didn't check futher entries, may be your scrpt can do that?
Thanks -- some of these where skipped by the first pass of the script. Ran it again, we're down to 729. However, eyeballing the result looks pretty clean, so I'll say this is good enough for now, as far as pages are concerned.
Thanks everyone for your help!
I'll take down the Pages spreadsheet.
Now, what about the users? Those are files (not directories) so we don't face the same low limit (32k), but it would be nice if we could, somehow, cleanup those files as well.
On Tue, Jan 15, 2013 at 1:06 PM, Dimi Paun wrote:
Thanks everyone for your help!
I'll take down the Pages spreadsheet.
Now, what about the users? Those are files (not directories) so we don't face the same low limit (32k), but it would be nice if we could, somehow, cleanup those files as well.
If I'm remembering right, a full install of Moinmoin (not just running the service portably in the unpacked tree) puts a moin command into /usr/bin. The documentation for it isn't great yet, but you can find it at http://master.moinmo.in/HelpOnMoinCommand.
Unfortunately, it doesn't have a mechanism for cleaning out users beyond obvious duplicate accounts. One possibility that I was looking at is that v1.5 of Moinmoin updates ".trail" files for all logged-in users, even if the page trail display has been disabled.
The idea was to scan the user directory for all .trail files with a mod-time older than a certain time period (I picked 1 year). If a user has logged in to do anything more recently than then, it should show up in the mod-time of the .trail file.
I wanted to test my scripts a little more, but this was one thing my sweep-once script at https://bitbucket.org/kauble/moin-admin was designed to do. Besides blanking-out and putting "file.new" instead of "file.tmp" in line 96 of split-logs.pl, the logic seemed sound on small test batches. I wanted to try it on a full copy of the Wine Wiki just to be safe though.
On Tue, Jan 15, 2013 at 4:40 AM, André Hentschel wrote:
This should also speed up that old wiki and maybe helps upgrading it (hopefully that'll happen soon :D).
I haven't touched a line of code in a couple of months (had a holiday job that really knocked the wind from my sails at times), but after getting settled into my classes over the next few days, I plan on working on moving the wiki to v1.9 of Moinmoin again.
The one thing that would probably help a lot is if there was a regularly updated tarball of the wiki content either at WineHQ or Lattica's FTP again. I haven't messed with cron itself much, but my archive.cron script should pack up the files correctly. The main complication is that the user dir probably should be shared on a need-to-know basis because it contains weakly-hashed password info.
Kyle
Hi folks,
Thanks for all the help and hits -- much appreciated.
I ended up writing a few scripts myself that cleaned up both the pages and users. It should do for now.
Please let me know if you see any problems with the wiki, I hope I wasn't over-eager when cleaning up spam :)))
Cheers, Dimi.
On 01/15/2013 11:10 AM, Kyle Auble wrote:
On Tue, Jan 15, 2013 at 1:06 PM, Dimi Paun wrote:
Thanks everyone for your help! I'll take down the Pages spreadsheet. Now, what about the users? Those are files (not directories) so we don't face the same low limit (32k), but it would be nice if we could, somehow, cleanup those files as well.
If I'm remembering right, a full install of Moinmoin (not just running the service portably in the unpacked tree) puts a moin command into /usr/bin. The documentation for it isn't great yet, but you can find it at http://master.moinmo.in/HelpOnMoinCommand.
Unfortunately, it doesn't have a mechanism for cleaning out users beyond obvious duplicate accounts. One possibility that I was looking at is that v1.5 of Moinmoin updates ".trail" files for all logged-in users, even if the page trail display has been disabled.
The idea was to scan the user directory for all .trail files with a mod-time older than a certain time period (I picked 1 year). If a user has logged in to do anything more recently than then, it should show up in the mod-time of the .trail file.
I wanted to test my scripts a little more, but this was one thing my sweep-once script at https://bitbucket.org/kauble/moin-admin was designed to do. Besides blanking-out and putting "file.new" instead of "file.tmp" in line 96 of split-logs.pl, the logic seemed sound on small test batches. I wanted to try it on a full copy of the Wine Wiki just to be safe though.
On Tue, Jan 15, 2013 at 4:40 AM, André Hentschel wrote:
This should also speed up that old wiki and maybe helps upgrading it (hopefully that'll happen soon :D).
I haven't touched a line of code in a couple of months (had a holiday job that really knocked the wind from my sails at times), but after getting settled into my classes over the next few days, I plan on working on moving the wiki to v1.9 of Moinmoin again.
The one thing that would probably help a lot is if there was a regularly updated tarball of the wiki content either at WineHQ or Lattica's FTP again. I haven't messed with cron itself much, but my archive.cron script should pack up the files correctly. The main complication is that the user dir probably should be shared on a need-to-know basis because it contains weakly-hashed password info.
Kyle
Hi Kyle,
On Tue, Jan 15, 2013 at 8:10 AM, Kyle Auble randomidman48@yahoo.com wrote:
The one thing that would probably help a lot is if there was a regularly updated tarball of the wiki content either at WineHQ or Lattica's FTP again. I haven't messed with cron itself much, but my archive.cron script should pack up the files correctly. The main complication is that the user dir probably should be shared on a need-to-know basis because it contains weakly-hashed password info.
Could the password hashes be excluded from the regular tarball? E.g. using --exclude in the tar command? --Juan
On Wed, Jan 16, 2013 at 12:19 AM, Juan Lang wrote:
Could the password hashes be excluded from the regular tarball? E.g. using --exclude in the tar command?
Sorry I didn't reply sooner, been a little busy the past week. I don't have a copy of the Wine Wiki data in front of me, but if I remember, the passwords aren't stored separately at the file level. Each user has a data file (and at least for v1.5, a .trail and possibly a .bookmark file).
The password is stored as a single record in that file. I'm no security expert, but my gut feeling is that separating the password data by default might be a good change upstream. Short of that though, I fiddled with reading off each password, running it through bcrypt, then putting it back into place before packing up the files.
It probably wouldn't be too hard to sift out the passwords before archiving the user directory. Ultimately, it seemed just keeping the user directory out of the open was the best bet though. My logic was that while there are several reasons someone might want to "fork" or independently archive the content (which is LGPL), I couldn't think of a legitimate reason someone would want everyone's account info without cooperating with the current maintainers.
-Kyle