This is a sample wiki engine, written in Python. It's meant as a learning aid, not a real tool: it lacks most functionalities, can serve only to one user at a time and stores all the page contents in memory – so they are gone when you restart it.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import BaseHTTPServer, urllib, re
class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
template = u"""<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd"><html><head><title>%s</title>
</head><body><h1>%s</h1><pre>%s</pre><form action="" method="POST"
class="editor"><div><textarea name="text">%s</textarea><input type="submit"
value="Save"></div></form></body></html>"""
def escape_html(self, text):
"""Replace special HTML characters with HTML entities"""
return text.replace(
"&", "&").replace(">", ">").replace("<", "<")
def link_repl(self, match):
"""Return HTML for link"""
title = match.group(1)
if title in self.server.pages:
return u"""<a href="%s">%s</a>""" % (title, title)
return u"""%s<a href="%s">?</a>""" % (title, title)
def do_HEAD(self):
"""Send response headers"""
self.send_response(200)
self.send_header("content-type", "text/html;charset=utf-8")
self.end_headers()
def do_GET(self):
"""Send page text"""
self.do_HEAD()
page = self.escape_html(urllib.unquote(self.path.strip('/')))
text = self.escape_html(self.server.pages.get(page, "Empty..."))
parsed = re.sub(r"\[\[([^]]+)\]\]", self.link_repl, text)
self.wfile.write(self.template % (page, page, parsed, text))
def do_POST(self):
"""Save new page text and display it"""
length = int(self.headers.getheader('content-length'))
if length:
text = self.rfile.read(length)
page = self.escape_html(urllib.unquote(self.path.strip('/')))
self.server.pages[page] = urllib.unquote_plus(text[5:])
self.do_GET()
if __name__ == '__main__':
server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.pages = {}
server.serve_forever()
To try this wiki, just run it with a Python interpreter on your computer, and point your web browser to http://127.0.0.1:8080.
This engine uses build in web server from Python's standard library, BaseHTTPServer, so that you don't need to setup your own web server or look for hosting services just to play with it. We provide this server with a custom request handler, that supports three kind of requests:
- HEAD request is done by your web browser to check if a page exist, if it changed recently, how large it is etc. In this code it is handled by the
do_HEADmethod, and we do only the absolute minimum required by the HTTP protocol:- we send the response code 200 (that means "everything is alright"),
- we say what kind of data the page contains (in this case, HTML web page with Unicode characters),
- and we send a marker for the end of headers (just an empty line).
- GET request is performed when your browser needs to download the page contents, so that it can display it.
- We need to send the headers, just like in the previous case, but we also need to send the page contents.
- First, we determine the name of the page.
- The attribute
self.pathcontains the part of the URL without the server name. - Because it was inside an URL, all characters not normally allowed in there had to be encoded in the form of
%XX, whereXXis the number of the character. We use the functionurllib.unquoteto convert it back into normal characters. - Now, because our page title is going to be displayed inside HTML, it can't contain characters like
<,>or&, because they could be confused for part of the markup. That's why we use functionescape_htmlto convert these three characters into form that is allowed inside HTML:<,>and&.
- The attribute
- Next step is to retrieve the actual text of the specified page – we try to take it from our
server.pagesdictionary, and fall back to "Empty…" if no such page was created yet. - Since the page's text is going to be displayed as a part of HTML, we have to escape it too.
- We must find all the links in the page text, and convert them into the actual HTML markup for the links.
- The function
re.subsearches for all occurrences of the specified pattern, and replaces them with the results of calling thelink_replfunction. - The function
link_repltakes the contents of first parenthesis in our link pattern (that is, everything that was between[[and]]), and checks if a page with this name already exists. Then it returns either link to existing or non-existing page (one with a question mark).
- The function
- Finally we just put it all together, using our
template, and send it to the browser.
- POST requests are made when you fill a form and send it to the server. In our case this means that you edited the page and clicked "Save". We need to update the page's contents and then display the page normally.
- First, we check how long the form data is, so that we can read it all. If the length is zero, we just skip the whole saving step.
- We read the data that was sent from the browser.
- We determine the page's title, just like before.
- Now we have to decode the data. It's encoded just like the path in the URL, only in addition all the spaces are replaced with
+characters. Also, because the text field in the form where the page's text was is called "text", the data will containtext=at the beginning – that's why we skip the first 5 characters. - We store the new page text in the dictionary and serve the page normally, like we did with GET.
There is a lot of space for improvement:
- We can store the last modification time of the pages, and send it in the headers – then the browser will know if that page changed since it was last downloaded. If the page didn't change, the browser won't download it again but show the cached version instead.
- We can send the page size in the headers, so that the browser can display a progress bar when downloading.
- We can send varying response codes depending on what actually happens. For example, if the page doesn't exist yet, we can send code 404 (Not found). When the page is first created, we can send 201 (Created), etc. This kind of additional information can be very helpful, for example web indexing spiders will know to not save all the "Empty page…" pages.
- If you save the page, then follow some link and hit "back", your browser will attempt to save the page again (and sometimes display an annoying message). It does that, because it remembers not only the URL of the page, but also the form data that was sent with it. We can easily prevent this: instead of sending the page's contents after it was saved, we can respond with a redirect: a code of 301 or 303, and a new address to redirect to – in this case the same URL, but without the POST data. Then the web browser will only remember the new page, without the form data.
- You can remember all the changes made, together with dates and the address from which they came, and display them on a special "recent changes" page.
- You can add more markup rules than just the simple links. Then, if parsing the page text when you send it takes too long, you can also store the "parsed" versions, and only parse when something was changed or when the user requests a refresh (you can recognize it by checking the request headers).
- You can move the editor form to a separate page. In order to do that, you need a way of recognizing if the user wants normal page or editor: you can do that by adding something to the page name, like "edit".
- Once you have one such special action, you can add more, for example searching.
- You don't have to just replace the text of saved page: you can keep the old versions of pages too. Then you can display page history and differences between versions.
- This engine can only serve one page to one user at a time: if more people uses the wiki, they have to wait in a queue. This is not a big problem, but you can improve the engine to serve several pages at a time. Remember to add page locking, so that when two people hit "Save" at the same time, the page text is not destroyed!
- Obviously it would be nice to have the pages actually saved somewhere, either into files or a database.
- And lots more…
I'm also writing down a detailed process in which I came up with this code (minus obvious errors and some frustration with empty POSTs) at Step By Step Wiki Engine.
Ok, so you've seen that 50-line wiki, but you would like to know how I actually wrote it? It's not any special feat, actually witting exceptionally small programs, although takes much more time, seems to me to be easier than writing elaborate code for doing the same thing. Mostly because there is less room for the bugs. Anyways, I thought it could be beneficial to show how you actually do it, not just the end result. So here goes.
This wiki engine was intended to be used as a earning tool, and I wanted it to work out-of-the-box anywhere possible (in this case, where Python is available). Because getting hosting service with Python is not trivial, and setting up your own web server may be too complicated on various operating systems, I decided that the engine must contain its own web server. I knew there is a simple web server implementation in the Python standard library, but I didn't know how to use it. So, naturally, the first step was a simple test server:
import BaseHTTPServer
handler = BaseHTTPServer.BaseHTTPRequestHandler
server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), handler)
server.serve_forever()
This code is the basic HTTP server that just runs there on port 8080 of our local host (127.0.0.1 always points to the local computer, it's sometimes called a loopback address) and responds with an error to every request:
Error response
Error code 501.
Message: Unsupported method ('GET').
Error code explanation: 501 = Server does not support this operation.You can terminate it by pressing ctrl+c twice in the console where you run it. I've chosen the port 8080, not the default one, 80, because there may be already a server running on that port, and on most systems you would need adminitrator privileges to use it. I tell it to only run on the loopback interface, and not on all interfaces for security reasons – I don't want anyone from the outside connecting to my experimental program. I can replace the "127.0.0.1" with just empty string "" to make it respond on all interfaces later.
The reason why it responds with error is obvious: it doesn't know how to do anything else, the handler we used is a blank slate, doesn't do anything useful yet. To make it do something, we need to add something to it, to extend it – and we can do that by making our own handler that inherits everything from the BaseHTTPRequestHandler, but in addition defines code to handle the GET and other methods. So, the next step is a simple "hello world":
import BaseHTTPServer
class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_header("content-type", "text/plain")
self.end_headers()
self.wfile.write("Hello world!")
server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.serve_forever()
I only implemented the do_GET method, because that's the default way web browsers "get" the web pages. All the rest of the Handler is copied from the BaseHTTPRequestHandler – I don't even know what required code may be possibly in there, but I know (from the Python documentation) that there are several useful methods in there:
- the
send_responsemethod, that starts a reply to the web browser, and sends the response code that we specify. We still need to know the response codes, but fortunatelly they are well documented in the RFC. - the
send_headermethod, that sends a HTTP header to the browser. In our case, we must define the MIME content type of the data we will be sending. Since in this simple "hello world" I don't need anything sophisticated, I just send "text/plain", which means that I will be sending normal text, without any special characters or formatting. end_headersmethod just sends an empty line to the browser – this means end of the HTTP headers and beginning of the actual content.
There is also a file-like object defined in the handler, called wfile, that I can write to to send tings to the web browser. I use it to send a "hello world" message. Directing our web browser to any address beginning with http://localhost:8080/ gives us:
Hello world!
Now we can display the pages, changing the content type is not a problem. But it would be nice to show different pages depending on the URL used. We can get that information from the path attribute:
import BaseHTTPServer
class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_header("content-type", "text/html; charset=utf-8")
self.end_headers()
self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
Hello world!
</body></html>""" % self.path)
server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.serve_forever()
Nice, it displays the page title as intended, we can easily remove the slashes from beginning, and optionally also from the end, using strip. There is however a problem for non-English speakers. Try this URL: http://localhost:8080/Łączka and you will see something like this:
/%C5%81%C4%85czka Hello world!
Not exactly as intended. What is happening? Only a small set of characters is allowed to appear inside URL, and all other characters have to be encoded in form of their numeric codes, prefixed with %. We set ourcharacter set to utf-8, so the url is encoded as utf-8. We just need to decode these characters. Fortunately, there is a ready function that does that in the Python standard library, in urllib.
import BaseHTTPServer, urllib
class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_header("content-type", "text/html; charset=utf-8")
self.end_headers()
page = urllib.unquote(self.path.strip('/'))
self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
Hello world!
</body></html>""" % page)
server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.serve_forever()
This takes care of all the special characters. Let's see how this works: http://localhost:8080/a<i>c. Weird, the characters are eaten, together with the "i". Let's look into the text that our web browser got from the server: using the "view page source" option present in most modern browsers, we can see:
<html><head><title>Sample</title></head><body>
<h1>a<i>c</h1>
Hello world!
</body></html>
The <i> is there alright, so what is going on here? Wait, that "c" looks a little weird, slanted as if it was italic. What was the "<i>" in HTML for? Right, these characters are treated as the HTML markup, not as content. What can we do to avoid that? The standard procedure is to encode the three special characters: "&", "<", and ">" as so-called entities. There is a list of available HTML entities, but we only need "&", "<" and ">" (they are derived from the names "ampersand", "lesser than" and "greater than"). Note, that we must replace the "&" first, otherwise we would break the ampersands in the other entities. Note, that if you don't escape all and any user-provided text in your web applications, you are opening a security hole and enabling cross-site scripting attacks (so-called XSS) and various tricks with styles.
import BaseHTTPServer, urllib
class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
def escape_html(self, text):
return text.replace("&", "&").replace(">", ">").replace("<", "<")
def do_GET(self):
self.send_response(200)
self.send_header("content-type", "text/html; charset=utf-8")
self.end_headers()
page = self.escape_html(urllib.unquote(self.path.strip('/')))
self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
Hello world!
</body></html>""" % page)
server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.serve_forever()
I made it into a separate method because it will come in handy several times. Looking at this now, I should have made it a function outside of the class, not a class method, but that's not so important at the moment.
The next step is to show some text for different pages. We will use a dictionary to store the page text. We don't need to save it into files or store in a database, because our server is running all the time. If it was a PHP or CGI script, then it would be restarted with every request, so we couldn't cheat like that.
import BaseHTTPServer, urllib
class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
def escape_html(self, text):
return text.replace("&", "&").replace(">", ">").replace("<", "<")
def do_GET(self):
self.send_response(200)
self.send_header("content-type", "text/html; charset=utf-8")
self.end_headers()
page = self.escape_html(urllib.unquote(self.path.strip('/')))
self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
<pre>%s</pre>
</body></html>""" % (page, self.server.pages.get(page, "empty")))
server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.pages = { "":'Hello <a href ="world">world</a>!', "hello":'world' }
server.serve_forever()
Initially I put the pages dictionary in the Handler class itself, but later decided to move it to the server object: mostly because I can initialize it easier this way, and also because I'm not sure if various more advanced implementations of the http server also only use a single instance of handler. Anyways, our web application now displays "Hello world!" with a link on the first page, "world" on the page titled "hello" and "empty" on all the others. I've put the text of the page in a <pre> block to preserve all whitespace and newlines.
It's all good, but it's not a wiki if you can't edit it. So we need an editor for our pages. I decided to put it on the same page that the rendered text – so that we don't need any special page names to indicate that we want the editor, not the page itself. Of course, special addresses will have to be introduced sooner or later if you want to have more advanced features. But I don't care about this for now, lets just have our wiki working first.
import BaseHTTPServer, urllib
class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
def escape_html(self, text):
return text.replace("&", "&").replace(">", ">").replace("<", "<")
def do_GET(self):
self.send_response(200)
self.send_header("content-type", "text/html; charset=utf-8")
self.end_headers()
page = self.escape_html(urllib.unquote(self.path.strip('/')))
text = self.server.pages.get(page, "empty")
self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
<pre>%s</pre><form action="" method="POST">
<p><textarea>%s</textarea>
<input type="submit" value="Save"></p></form>
</body></html>""" % (page, text, text))
server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.pages = { "":'Hello <a href ="world">world</a>!', "hello":'world' }
server.serve_forever()
Ok, let's see: I have added the textarea to the HTML, and now I retrieve the text of the page to a variable, because I have to repeat it twice. Let's see how it works: it displays the editor just fine, but as soon as you press "Save", you can see:
Error response
Error code 501.
Message: Unsupported method ('POST').
Error code explanation: 501 = Server does not support this operation.Looks familiar? Of course, we only have the GET method implemented, and not the POST. We need to make a do_POST method:
import BaseHTTPServer, urllib
class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
def escape_html(self, text):
return text.replace("&", "&").replace(">", ">").replace("<", "<")
def do_GET(self):
self.send_response(200)
self.send_header("content-type", "text/html; charset=utf-8")
self.end_headers()
page = self.escape_html(urllib.unquote(self.path.strip('/')))
text = self.server.pages.get(page, "empty")
self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
<pre>%s</pre><form action="" method="POST">
<p><textarea>%s</textarea>
<input type="submit" value="Save"></p></form>
</body></html>""" % (page, text, text))
def do_POST(self):
self.do_GET()
server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.pages = { "":'Hello <a href ="world">world</a>!', "hello":'world' }
server.serve_forever()
This should do it. Yes, I am lazy. But it's almost done, we only need some code that would actually take the text that is posted and save it into our pages dictionary. We can read that text from the self.rfile file of the handler. Piece of cake.
import BaseHTTPServer, urllib
class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
def escape_html(self, text):
return text.replace("&", "&").replace(">", ">").replace("<", "<")
def do_GET(self):
self.send_response(200)
self.send_header("content-type", "text/html; charset=utf-8")
self.end_headers()
page = self.escape_html(urllib.unquote(self.path.strip('/')))
text = self.server.pages.get(page, "empty")
self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
<pre>%s</pre><form action="" method="POST">
<p><textarea>%s</textarea>
<input type="submit" value="Save"></p></form>
</body></html>""" % (page, text, text))
def do_POST(self):
text = self.rfile.read()
page = self.escape_html(urllib.unquote(self.path.strip('/')))
self.server.pages[page] = text
self.do_GET()
server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.pages = {}
server.serve_forever()
So, I read the text from the rfile, and assign it to the apropriate page name – I need to compute the page name again, because it's not known yet at this point. For now, I just copied the the relevant name, but if I had to do it a third time, I would move it to a separate function, or store in an attribute.
I also removed the initial text from the pages, as we are going to be able to edit them, at least that's the plan.
Well, but the new code doesn't work. When you hit "Save" it just keeps on waiting to load the page, until it timeouts. What's wrong? I kept on struggling with this for a long while. For some reason, the wiki engine keeps on waiting for the text, but never receives any. I thought that it may be trying to read too much: let's check how much there is to read first, we can check it in the request headers:
import BaseHTTPServer, urllib
class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
def escape_html(self, text):
return text.replace("&", "&").replace(">", ">").replace("<", "<")
def do_GET(self):
self.send_response(200)
self.send_header("content-type", "text/html; charset=utf-8")
self.end_headers()
page = self.escape_html(urllib.unquote(self.path.strip('/')))
text = self.server.pages.get(page, "empty")
self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
<pre>%s</pre><form action="" method="POST">
<p><textarea>%s</textarea>
<input type="submit" value="Save"></p></form>
</body></html>""" % (page, text, text))
def do_POST(self):
length = int(self.headers.getheader('content-length'))
text = self.rfile.read(length)
page = self.escape_html(urllib.unquote(self.path.strip('/')))
self.server.pages[page] = text
self.do_GET()
server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.pages = {}
server.serve_forever()
Now, this gives us an interesting result: the page is erased when we try to save it, there is no text! What is happening? Then I looked into the source of the page, and noticed the error: the textarea tag has no name on it! Nameless tags are not passed in the form data, that's why why get no content! Just giving a name to the textare tag fixes it:
import BaseHTTPServer, urllib
class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
def escape_html(self, text):
return text.replace("&", "&").replace(">", ">").replace("<", "<")
def do_GET(self):
self.send_response(200)
self.send_header("content-type", "text/html; charset=utf-8")
self.end_headers()
page = self.escape_html(urllib.unquote(self.path.strip('/')))
text = self.server.pages.get(page, "empty")
self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
<pre>%s</pre><form action="" method="POST">
<p><textarea name="text">%s</textarea>
<input type="submit" value="Save"></p></form>
</body></html>""" % (page, text, text))
def do_POST(self):
length = int(self.headers.getheader('content-length'))
text = self.rfile.read(length)
page = self.escape_html(urllib.unquote(self.path.strip('/')))
self.server.pages[page] = text
self.do_GET()
server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.pages = {}
server.serve_forever()
Let's try and save the default "empty" text and see what comes out:
text=empty
Of course, the format of the data needs to contain the variable names! The different pieces are separate with the "&" character, and the "name=" is perpended to every one of them (with the "name" replaced with the actual form field name, of course). I was pretty tired with the previous problem, so I did something wrong. Something very wrong. I just remove the first 5 characters of the data, the one that is supposed to contain the substring "text=". When you write a real web application, you can't really depend on the exact format of that that web browser sends you back like that! Even if you limited the from fields size, even if you do some client-side validation with JavaScript, even if the form absolutely must contain certain fields, you can't rely on it – because browser sometimes behave weird, because users disable JavaScript for better performance and security, but most important, because you can always receive some forged requests from users trying to hack your site or bots. That's why you always need to re-check the validity of the data you receive, and never use any data directly in your code unquoted – be it SQL query, binary files or HTML output. Well, at least we can escape the HTML with our ready escaping function.
import BaseHTTPServer, urllib
class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
def escape_html(self, text):
return text.replace("&", "&").replace(">", ">").replace("<", "<")
def do_GET(self):
self.send_response(200)
self.send_header("content-type", "text/html; charset=utf-8")
self.end_headers()
page = self.escape_html(urllib.unquote(self.path.strip('/')))
text = self.server.pages.get(page, "empty")
self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
<pre>%s</pre><form action="" method="POST">
<p><textarea name="text">%s</textarea>
<input type="submit" value="Save"></p></form>
</body></html>""" % (page, text, text))
def do_POST(self):
length = int(self.headers.getheader('content-length'))
text = self.rfile.read(length)
page = self.escape_html(urllib.unquote(self.path.strip('/')))
self.server.pages[page] = self.escape_html(text[5:])
self.do_GET()
server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.pages = {}
server.serve_forever()
The output still doesn't look right. Saving text "Hello world!" gives us:
Hello+world%21
We need to decode these quoted characters, in similar way we did with urls, with an additional twist: all the spaces are convertet to "+" characters, so we have to decode them too. There is a ready function for this in the urllib:
import BaseHTTPServer, urllib
class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
def escape_html(self, text):
return text.replace("&", "&").replace(">", ">").replace("<", "<")
def do_GET(self):
self.send_response(200)
self.send_header("content-type", "text/html; charset=utf-8")
self.end_headers()
page = self.escape_html(urllib.unquote(self.path.strip('/')))
text = self.server.pages.get(page, "empty")
self.wfile.write("""<html><head><title>Sample</title></head><body>
<h1>%s</h1>
<pre>%s</pre><form action="" method="POST">
<p><textarea name="text">%s</textarea>
<input type="submit" value="Save"></p></form>
</body></html>""" % (page, text, text))
def do_POST(self):
length = int(self.headers.getheader('content-length'))
text = self.rfile.read(length)
page = self.escape_html(urllib.unquote(self.path.strip('/')))
self.server.pages[page] = self.escape_html(urlli.unquote_plus(text[5:]))
self.do_GET()
server = BaseHTTPServer.HTTPServer(("127.0.0.1", 8080), Handler)
server.pages = {}
server.serve_forever()
Now it looks good! The next step is adding links.
To be continued…
![[Home]](/+download/logo.png)