19

I have a PHP web applications. I do NOT want to allow users to post HTML to my site.

If I simply run strip_tags() on all data prior to saving into my database, will strip_tags() be enough to prevent XSS?

I ask because it's unclear to me from reading the documentation of strip_tags if XSS is prevented. There seems to be some bug with browser allowing <0/script> (yes, a zero) as valid HTML.

UPDATE

I realize that I can simply run htmlspecialchars on all outputted data; however, my thought is that - since I don't want to allow HTML in the first place, it's simply easier (and academically better) to clean my data once and for all, before saving in my database, then have to worry every time I output the data if the data is safe or not.

6 Answers 6

23

I strongly disagree it's "academically better".

  • It breaks user input (imagine how useless StackOverflow would be for this discussion if they "cleaned" posts from all tags).

  • Text inserted in HTML with only tags stripped will be invalid. HTML requires & to be escaped as well.

  • It's not even safe in HTML! strip_tags() is not enough to protect values in attributes, e.g., <input value="$foo"> might be exploited with $foo = " onfocus="evil() (no <,> needed!)

So the correct solution is to escape data according to requirements of language you're generating. When you have plain text and you're generating HTML, you should convert text to HTML with htmlspecialchars() or such. When you're generating e-mail, you should convert text to quoted-printable format, and so on.

1
  • 4
    You're generally right but user input is not a holy script that cannot be touched and StackOverflow-like services are an exception rather than a rule. If you application expects a positive integer in parameter ID then you should drop anything that is not a positive integer.
    – kravietz
    Mar 23, 2015 at 12:53
14

strip_tags itself is not going to be sufficient as it removes perfectly valid, non-HTML content. For instance:

<?php
 echo strip_tags("This could be a happy clown *<:) or a puckered face.\n");
 ....
 echo strip_tags("Hey guys <--- look at this!\n");

Will output:

This could be a happy clown *

And:

Hey guys

Everything after the initial < gets removed. Very annoying for end users! Disallowing reserved HTML characters would be a bad move. And these characters will need to be escaped with htmlentities or a similar function when used inline with HTML.

You need something more advanced that strip_tags - HTML Purifier works great and will allow users to use HTML reserved characters.

6

As others have mentioned, you can use a combination of strip_tags and htmlspecialchars to protect yourself against XSS.

One bad thing about strip_tags is that it might remove harmless content that the user will not expect. I see techies write stuff like: <edit> foo </edit>, where they fully expect those tags to be seen as is. Also, I've seen "normal" people even do things like <g> for "grin." Again, they will think it's a bug if that doesn't show up.

So personally, I avoid strip_tags in preference for my own parser that allows me to explicitly enable certain safe HTML tags, attributes and CSS, explicitly disable unsafe tags and attributes, and convert any other special character to harmless versions. Thus the text is always seen as one would expect.

If I didn't have that parser at my disposal, I would simply use htmlspecialchars to safely encode the text.

4
  • 3
    Your parser might not be safe. HTML adds new elements and attributes. Browsers support non-standard ones. Error recovery in browsers is quite crazy, and might parse things you wouldn't expect. It's pretty hard to keep up. OTOH if you run htmlspecialchars() on all output, every tag-like text will be safely converted, and will always be safe.
    – Kornel
    Dec 9, 2010 at 23:49
  • 2
    @porneL, The parser I'm referring to only allows known tags and attributes from a known white list. Any other non-safe character becomes encoded. Thus it's impossible for unsafe code to be generated.
    – Matthew
    Dec 9, 2010 at 23:56
  • What if I'm sure the data being entered in the site should not contain any HTML-like characters? For example, a form for entering your address? In that case, is strip_tags enough?
    – abeger
    Aug 22, 2011 at 15:38
  • 2
    @abeger, strip_tags could still be problematic because maybe the user typed "<123> Main Street" for some odd reason. It's always going to be dangerous to just silently delete user entered data. In this case, if I wanted to prevent such characters in the field, I probably would not process the form. Instead, I'd display an error to the user saying that the field contained invalid characters. The presence of one of those characters may indicate that they accidentally entered the wrong data into the field.
    – Matthew
    Aug 22, 2011 at 19:46
3

It should, I have never heard of that 0 trick before. But you can always do the strip_tags and then the htmlspecialchars just to be safe. Good practice would be to test this yourself on your application, as you know what type of data you can try and input and test and see if it breaks it. Just search for methods of XSS exploits and use that for your test data. I would check at least weekly for new vulnerabilities and continually test your script to those new exploits that come out.

2
  • 2
    +1 Use htmlspecialchars() anyway to ensure at least nothing gets accidentally parsed as if it were HTML.
    – BoltClock
    Aug 31, 2010 at 2:32
  • 2
    What would be the benefit of using strip_tags if you have already used htmlspecialchars?
    – Mystical
    Feb 12, 2019 at 18:48
3

Need help treating html as plain text within the document? Need to echo the value of an attribute without being vunerable to XSS attacks like <input value="<?php echo '" onkeydown="alert(&quot;XSS&quot;)'; ?>" />?

Use htmlentities().

echo htmlentities('<p>"..."</p>');
// result: &lt;p&gt;&quot;...&quot;&lt;/p&gt;

No strip_tags() required, as this function already replaces < and > with the &lt; and &gt; entities.

What's the difference between htmlentities() and htmlspecialchars() you may ask?

Well, htmlentities() will encode ANY character that has an HTML entity equivalent,

while htmlspecialchars() ONLY encodes a small set of the most problematic characters.

1

strip_tags() can help, but it's not bulletproof. Since it doesn't validate the HTML it's stripping, some clever person WILL find an HTML construct (mangled or otherwise) that gets stripped and but still results in something nasty getting through. But for now, it should handle most everything that gets thrown at it. Just don't assume that this will be true forever.

As well, if you allow any tags to pass through via the 'allowable tags' parameter, that will let through any of the javascript-specific attributes, such as onclick for those specific tags.

Not the answer you're looking for? Browse other questions tagged or ask your own question.