Google Sites - HTML Tag Cleaner

From NoskeWiki
Jump to: navigation, search

About

NOTE: This page is a daughter page of: Google Sites, and is related to: HTML and JavaScript


This is a simple "Google Sites HTML Cleaner" I created where you paste the nasty HTML generated by Google Sites into the middle, and it removes a bunch of tags to produce a nicer version on the right (for you to paste back into Google Site's HTML editor). To try it out go here:

... otherwise read on and I'll show you the source code which you can modify to make your own.


Google Sites HTML Cleaner gui in action: Paste into the middle - cleaned up on the right


Background and Instructions

A huge complaint I have of Google Sites is the mess it makes of the HTML code as you create pages. To see, just click the edit button on any Google Site then the "HTML" button. Notice it adds a tonne of unnecessary div tags! If you are unlucky, it will often add a bunch of other unnecessary tags, including: font size, color, align, style, span, code and so on. If you are lucky you will rarely notice this messiness in the default (WYSIWYG) editor, but sometimes the page will do unexpected things with formatting of font, bullet points and so on..... and that's when it's messed things up really badly. This can make your page unmanageable and is a particularity large problem if you want to copy the HTML to another page or Google Site.


Google Sites example page (about to click "HTML" button)
Same Google Sites page showing messy HTML


To help remove these tags I created a small HTML page which lets you paste in the HTML5 content in the big middle text area, and have a cleaner version appear on the right. To keep the code simple I do not analyze the code as a tree, but simply use JavaScript's regex and string replace function to remove a bunch of tags you simply don't need. The list of string matches and regex matches are on the far left, and you can add more in if you see extra stuff you don't need.... just copy and paste the lines you need... and even copy the "after" into the "before" to effectively save your progress.


By keeping the JavaScript and HTML code as simple/general as possible, others should be able to copy this same code and perform other nice find/replace operations.


Next steps:
One of my goals is to make a more sophisticated version which uses JavaScript syntax highlighting and a tree of tags, so it can do smarter tag removal and show you when you have errors with unclosed tags. If I make this I might make it internal to Google and hopefully convince some people that this option should be made part of the Google Sites code base so anyone can run "cleanup HTML" without all the messing around with copy-paste and replacements.

If you know of a similar tool or get a chance to do any of this before me, please email! andrew.noskeATSIGNgmail.com



Google Sites HTML Cleaner - Source Code

Instructions: create the following three files, as named, in the same directory, then execute the HTML file.

google_site_html_cleaner.html

<!DOCTYPE html>
<html>
<head>
<title>Google Sites HTML Cleaner</title>
<link rel="stylesheet" type="text/css" href="google_site_html_cleaner.css" />
<script src="google_site_html_cleaner.js"></script>
</head>
<body>
 
<div id="header_options">
<b>Options</b>
</div>
 
<div id="options_area">
 
Remove exact matches:
<div id="medium_text_area">
<textarea id="txt-remove-match" onchange="updateCode()">
<div>
</div>
<span>
</span>
</font>
</textarea>
</div>
 
<br>
Remove with regex:
<div id="small_text_area">
<textarea id="txt-remove-regex" onchange="updateCode()">
font size="\d+"
font color="#\S\S\S\S\S\S"
span style="[^"]+"
div style="[^"]+"
</textarea>
</div>
 
<br>
<input type="checkbox" id="chk-clean-tags" name="chk-clean-tags" onchange="updateCode()" checked>Remove empty tags<br>
<input type="checkbox" id="chk-remove-space" name="chk-remove-space" onchange="updateCode()" checked>Remove bad spaces<br>
<input type="button" onclick="updateCode()" value="Update!">
<br>
 
<hr>
 
<br>
Ignore:
<br><i>(copy in/out stuff to ignore)</i>
<div id="small_text_area">
<textarea id="txt-remove-match" rows=20>
font color="#\S\S\S\S\S\S"
style="[^"]+"
</textarea>
</div>
 
</div>
 
 
<div id="code_area">
 
<div id="header_before">
<b>Before</b> (paste code here)
</div>
 
<div id="code_area_before">
<textarea id="txt-code-before" onchange="updateCode()">
Paste code in here...
</textarea>
</div>
 
 
<div id="header_after">
<b>After</b> (copy from here)
</div>
 
<div id="code_area_after" readonly>
<textarea id="txt-code-after">
...
</textarea>
</div>
 
</div>
 
</body>
</html>


google_site_html_cleaner.js

/**
 * Inputs a long 'haystack_str' string and does a find and replace
 * all 'from_str' to 'to_str' with no regex.
 * @param {!string} haystack_str The text to 'search'.
 * @param {!string} from_str The text to 'find'.
 * @param {!string} to_str The text to 'replace with'.
 * @return {string}  Returns the 'haystack' string after substitutions.
 */
function replaceAll(haystack_str, from_str, to_str) {
  return haystack_str.split(from_str).join(to_str);
  // NOTE: I haven't used "string.replace" because it only replaces the
  // first occurance and the regex version looks for regex characters.
}
 
/**
 * Gets the vales from a series of textarea boxes and checkboxes
 * and performs a series of replace operations. The final result:
 * the string in the textarea with id 'txt-code-before' gets replaced
 * with modified version of the string from 'txt-code-before'.
 */
function updateCode() {
  // Get all boolean values from checkboxes:
  var cleanTags    = document.getElementById("chk-clean-tags").checked;
  var removeSpaces = document.getElementById("chk-remove-space").checked;
 
  // Get all string values from textareas:
  var beforeText = document.getElementById("txt-code-before").value;
  var regexText = document.getElementById("txt-remove-regex").value;
  var matchText = document.getElementById("txt-remove-match").value;
  var text = beforeText;  // Will become new HTML string.
 
  // Process array of regex values to remove:
  var regexLine = regexText.split("\n");
  for (var i = 0; i < regexLine.length; i++) {
    if (regexLine[i].length < 2)
      continue;
    var regex = new RegExp(regexLine[i], "g");  // "g" replaces all occurances.
    text = text.replace(regex, "");
  }
 
  // Process array of exact match values to remove:
  var matchLine = matchText.split("\n");
  for (var i = 0; i < matchLine.length; i++) {
    if (matchLine[i].length == 0)
      continue;
    text = replaceAll(text, matchLine[i], "");
  }
 
  // Clean up spaces:
  if (removeSpaces) {
    text = replaceAll(text, "&nbsp;", " ");
    text = replaceAll(text, "  ", " ");
    text = replaceAll(text, "  ", " ");
  }
 
  // Clean up tags and remove empty ones (if desired): 
  if (cleanTags) {
    text = replaceAll(text, "</ ", "</");
    text = replaceAll(text, "< ", "<");
    text = replaceAll(text, " >", ">");
    text = replaceAll(text, "<>", "");
  }
 
  document.getElementById("txt-code-after").value = text;
}


google_site_html_cleaner.css

#header_options     { position:absolute; top:0px; left:0%; right:50%; bottom:20px }
#options_area   { position:absolute; top:20px; left:0px; width:290px; bottom:0      }
 
#small_text_area  { position:relative; width:100%; top:0px; height:200px; }
#medium_text_area { position:relative; width:100%; top:0px; height:400px; }
 
#code_area   { position:absolute; top:0px; left:300px; right:0; bottom:0      }
 
#header_before      { position:absolute; top:0px; left:0%; right:50%; bottom:20px }
#code_area_before   { position:absolute; top:20px; left:0; right:50%; bottom:0      }
 
#header_after      { position:absolute; top:0px; left:50%; right:0; bottom:20px }
#code_area_after   { position:absolute; top:20px; left:50%; right:0; bottom:0   }
 
textarea { position:absolute; top:0; left:0; width:98%; bottom:0; font-size: 90%; }



Links

  • Google Sites - a fantastic wiki by Google, although - like I said - it does a pretty bad job with the underlying HTML!
  • DirtyMarkup.com - a brilliant tool which formats the HTML you paste in, and shows any errors / unmatching tags.