The URL Class → Explore with me!

Introduction to URL Class

The java.net.URL class is an abstraction of a Uniform Resource Locator such as http://www.lolcats.com/ or ftp://ftp.redhat.com/pub/. It extends java.lang.Object, and it is a final class that cannot be subclassed. Rather than relying on inheritance to configure instances for different kinds of URLs, it uses the strategy design pattern. Protocol handlers are the strategies, and the URL class itself forms the context through which the different strategies are selected. Although storing a URL as a string would be trivial, it is helpful to think of URLs as objects with fields that include the scheme (a.k.a. the protocol), hostname, port, path, query string, and fragment identifier (a.k.a. the ref), each of which may be set independently. Indeed, this is almost exactly how the java.net.URL class is organized, though the details vary a little between different versions of Java. URLs are immutable. After a URL object has been constructed, its fields do not change. This has the side effect of making them thread safe.

Creating new URLs

Unlike the InetAddress objects in previous topic, you can construct instances of java.net.URL. The constructors differ in the information they require. We can create the object of URL in either of 4 ways.

public URL(String url) throws MalformedURLException

public URL(String protocol, String hostname, String file) throws MalformedURLException

public URL(String protocol, String host, int port, String file) throws MalformedURLException

public URL(URL base, String relative) throws MalformedURLException

Which constructor you use depends on the information you have and the form it’s in. All these constructors throw a MalformedURLException if you try to create a URL for an unsupported protocol or if the URL is syntactically incorrect. Exactly which protocols are supported is implementation dependent. The only protocols that have been available in all virtual machines are http and file, and the latter is notoriously flaky. Today, Java also supports the https, jar, and ftp protocols. Some virtual machines support mailto and gopher as well as some custom protocols like doc, netdoc, systemresource, and verbatim used internally by Java.

Important

If the protocol you need isn’t supported by a particular VM, you maybe able to install a protocol handler for that scheme to enable the URL class to speak that protocol. In practice, this is way more trouble than it’s worth. You’re better off using a library that exposes a custom API just for that protocol

Other than verifying that it recognizes the URL scheme, Java does not check the correctness of the URLs it constructs. The programmer is responsible for making sure that URLs created are valid. For instance, Java does not check that the hostname in an HTTP URL does not contain spaces or that the query string is x-www-form-URL-encoded. It does not check that a mailto URL actually contains an email address. You can create URLs for hosts that don’t exist and for hosts that do exist but that you won’t be allowed to connect to.

Constructing a URL from a string

The simplest URL constructor just takes an absolute URL in string form as its single argument:

public URL(String url) throws MalformedURLException

Like all constructors, this may only be called after the new operator, and like all URL constructors, it can throw a MalformedURLException. The following code constructs a URL object from a String, catching the exception that might be thrown:

try {
    URL u = new URL("http://www.audubon.org/");
} catch (MalformedURLException ex) {
    System.err.println(ex);
}

The nonsupport of RMI and JDBC is actually a little deceptive; in fact, the JDK does support these protocols. However, that support is through various parts of the java.rmi and java.sql packages, respectively. These protocols are not accessible through the URL class like the other supported protocols (although I have no idea why Sun chose to wrap up RMI and JDBC parameters in URL clothing if it wasn’t intending to interface with these via Java’s quite sophisticated mechanism for handling URLs).

Other Java 7 virtual machines will show similar results. VMs that are not derived from the Oracle codebase may vary somewhat in which protocols they support. For example, Android’s Dalvik VM only supports the required http, https, file, ftp, and jar protocols.

Constructing a URL from its component part

You can also build a URL by specifying the protocol, the hostname, and the file:

public URL(String protocol, String hostname, String file) throws MalformedURLException

This constructor sets the port to -1 so the default port for the protocol will be used. The file argument should begin with a slash and include a path, a filename, and optionally a fragment identifier. Forgetting the initial slash is a common mistake, and one that is not easy to spot. Like all URL constructors, it can throw a MalformedURLException. For example:

try {
    URL u = new URL("http", "www.eff.org", "/blueribbon.html#intro");
} catch (MalformedURLException ex) {
    throw new RuntimeException("shouldn't happen; all VMs recognize http");
}

This creates a URL object that points to http://www.eff.org/blueribbon.html#intro, using the default port for the HTTP protocol (port 80). The file specification includes a reference to a named anchor. The code catches the exception that would be thrown if the virtual machine did not support the HTTP protocol. However, this shouldn’t happen in practice.

For the rare occasions when the default port isn’t correct, the next constructor lets you specify the port explicitly as an int. The other arguments are the same. For example, this code fragment creates a URL object that points to http://fourier.dur.ac.uk:8000/~dma3mjh/jsci/, specifying port 8000 explicitly:

try {
    URL u = new URL("http", "fourier.dur.ac.uk", 8000, "/~dma3mjh/jsci/");
} catch (MalformedURLException ex) {
    throw new RuntimeException("shouldn't happen; all VMs recognize http");
}

Constructing relative URLs

This constructor builds an absolute URL from a relative URL and a base URL:

public URL(URL base, String relative) throws MalformedURLException

For instance, you may be parsing an HTML document at http://www.ibiblio.org/javafaq/index.html and encounter a link to a file called mailinglists.html with no further qualifying information. In this case, you use the URL to the document that contains the link to provide the missing information. The constructor computes the new URL as http://www.ibiblio.org/javafaq/mailinglists.html. For example:

try {
    URL u1 = new URL("http://www.ibiblio.org/javafaq/index.html");
    URL u2 = new URL (u1, "mailinglists.html");
} catch (MalformedURLException ex) {
    System.err.println(ex);
}

The filename is removed from the path of u1 and the new filename mailinglists.html is appended to make u2. This constructor is particularly useful when you want to loop through a list of files that are all in the same directory. You can create a URL for the first file and then use this initial URL to create URL objects for the other files by substituting their filenames.

Other sources of URL object

Besides the constructors discussed here, a number of other methods in the Java class library return URL objects. In applets, getDocumentBase() returns the URL of the page that contains the applet and getCodeBase() returns the URL of the applet .class file.

The java.io.File class has a toURL() method that returns a file URL matching the given file. The exact format of the URL returned by this method is platform dependent. For example, on Windows it may return something like file:/D:/JAVA/JNP4/05/ToURLTest.java. On Linux and other Unixes, you’re likely to see file:/home/elharo/books/JNP4/05/ToURLTest.java. In practice, file URLs are heavily platform and program dependent. Java file URLs often cannot be interchanged with the URLs used by web browsers and other programs, or even with Java programs running on different platforms.

Class loaders are used not only to load classes but also to load resources such as images and audio files. The static ClassLoader.getSystemResource(String name) method returns a URL from which a single resource can be read. The ClassLoader.getSystem Resources(String name) method returns an Enumeration containing a list of URLs from which the named resource can be read. And finally, the instance method getResource(String name) searches the path used by the referenced class loader for a URL to the named resource. The URLs returned by these methods may be file URLs, HTTP URLs, or some other scheme. The full path of the resource is a package qualified Java name with slashes instead of periods such as /com/macfaq/sounds/swale.au or com/macfaq/images/headshot.jpg. The Java virtual machine will attempt to find the requested resource in the classpath, potentially inside a JAR archive.

There are a few other methods that return URL objects here and there throughout the class library, but most are simple getter methods that return a URL you probably already know because you used it to construct the object in the first place; for instance, the getPage() method of javax.swing.JEditorPane and the getURL() method of java.net.URLConnection.

Retrieving data from URL

Naked URLs aren’t very exciting. What’s interesting is the data contained in the documents they point to. The URL class has several methods that retrieve data from a URL:

public InputStream openStream() throws IOException

public URLConnection openConnection() throws IOException

public URLConnection openConnection(Proxy proxy) throws IOException

public Object getContent() throws IOException

public Object getContent(Class[] classes) throws IOException

The most basic and most commonly used of these methods is openStream(), which returns an InputStream from which you can read the data. If you need more control over the download process, call openConnection() instead, which gives you a URLConnection which you can configure, and then get an InputStream from it. Finally, you can ask the URL for its content with getContent() which may give you a more complete object such as String or an Image. Then again, it may just give you an InputStream anyway.

public final InputStream openStream() throws IOException

The openStream() method connects to the resource referenced by the URL, performs any necessary handshaking between the client and the server, and returns an Input Stream from which data can be read. The data you get from this InputStream is the raw (i.e., uninterpreted) content the URL references: ASCII if you’re reading an ASCII text file, raw HTML if you’re reading an HTML file, binary image data if you’re reading an image file, and so forth. It does not include any of the HTTP headers or any other protocol-related information. You can read from this InputStream as you would read from any other InputStream. For example:

try {
    URL u = new URL("http://www.lolcats.com");
    InputStream in = u.openStream();
    int c;
    while ((c = in.read()) != -1) System.out.write(c);
    in.close();
} catch (IOException ex) {
    System.err.println(ex);
}

The preceding code fragment catches an IOException, which also catches the MalformedURLException that the URL constructor can throw, since MalformedURLException subclasses IOException. As with most network streams, reliably closing the stream takes a bit of effort. In Java 6 and earlier, we use the dispose pattern: declare the stream variable outside the try block, set it to null, and then close it in the finally block if it’s not null. For example:

InputStream in = null
try {
    URL u = new URL("http://www.lolcats.com");
    in = u.openStream();
    int c;
    while ((c = in.read()) != -1) System.out.write(c);
} catch (IOException ex) {
    System.err.println(ex);
} finally {
    try {
        if (in != null) {
            in.close();
        }
    } catch (IOException ex) {
        // ignore
    }
}

Java 7 makes this somewhat cleaner by using a nested try-with-resources statement:

try {
    URL u = new URL("http://www.lolcats.com");
    try (InputStream in = u.openStream()) {
        int c;
        while ((c = in.read()) != -1) System.out.write(c);
    }
} catch (IOException ex) {
    System.err.println(ex);
}

Splitting a URL into pieces

URLs are composed of five pieces:

The scheme, also known as the protocol
The authority
The path
The fragment identifier, also known as the section or ref
The query string

For example, in the URL http://www.ibiblio.org/javafaq/books/jnp/index.html?isbn=1565922069#toc, the scheme is http, the authority is www.ibiblio.org, the path is /javafaq/books/jnp/index.html, the fragment identifier is toc, and the query string is isbn=1565922069. However, not all URLs have all these pieces. For instance, the URL http://www.faqs.org/rfcs/rfc3986.html has a scheme, an authority, and a path, but no fragment identifier or query string.

The authority may further be divided into the user info, the host, and the port. For example, in the URL http://admin@www.blackstar.com:8080/, the authority is admin@www.blackstar.com:8080. This has the user info admin, the host www.blackstar.com, and the port 8080.

Read-only access to these parts of a URL is provided by nine public methods: getFile(), getHost(), getPort(), getProtocol(), getRef(), getQuery(), getPath(), getUserInfo(), and getAuthority().

public String getProtocol()

The getProtocol() method returns a String containing the scheme of the URL (e.g., “http”, “https”, or “file”). For example, this code fragment prints https:

URL u = new URL("https://xkcd.com/727/");
System.out.println(u.getProtocol());

public String getHost()

The getHost() method returns a String containing the hostname of the URL. For example, this code fragment prints xkcd.com:

URL u = new URL("https://xkcd.com/727/");
System.out.println(u.getHost());

public int getPort()

The getPort() method returns the port number specified in the URL as an int. If no port was specified in the URL, getPort() returns -1 to signify that the URL does not specify the port explicitly, and will use the default port for the protocol. For example, if the URL is http://www.userfriendly.org/, getPort() returns -1; if the URL is http://www.userfriendly.org:80/, getPort() returns 80. The following code prints -1 for the port number because it isn’t specified in the URL:

URL u = new URL("http://www.ncsa.illinois.edu/AboutUs/");
System.out.println("The port part of " + u + " is " + u.getPort());

public int getDefaultPort()

The getDefaultPort() method returns the default port used for this URL’s protocol when none is specified in the URL. If no default port is defined for the protocol, then getDefaultPort() returns -1. For example, if the URL is http://www.userfriendly.org/, getDefaultPort() returns 80; if the URL is ftp://ftp.userfriendly.org:8000/, getDefaultPort() returns 21.

public String getFile()

The getFile() method returns a String that contains the path portion of a URL; remember that Java does not break a URL into separate path and file parts. Everything from the first slash (/) after the hostname until the character preceding the # sign that begins a fragment identifier is considered to be part of the file. For example:

URL page = this.getDocumentBase();
System.out.println("This page's path is " + page.getFile());

If the URL does not have a file part, Java sets the file to the empty string.

public String getPath()

The getPath() method is a near synonym for getFile(); that is, it returns a String containing the path and file portion of a URL. However, unlike getFile(), it does not include the query string in the String it returns, just the path.

public String getRef()

The getRef() method returns the fragment identifier part of the URL. If the URL doesn’t have a fragment identifier, the method returns null. In the following code, getRef() returns the string xtocid1902914:

URL u = new URL("http://www.ibiblio.org/javafaq/javafaq.html#xtocid1902914");
System.out.println("The fragment ID of " + u + " is " + u.getRef());

public String getQuery()

The getQuery() method returns the query string of the URL. If the URL doesn’t have a query string, the method returns null. In the following code, getQuery() returns the string category=Piano:

URL u = new URL("http://www.ibiblio.org/nywc/compositions.phtml?category=Piano");
System.out.println("The query string of " + u + " is " + u.getQuery());

public String getUserInfo()

Some URLs include usernames and occasionally even password information. This information comes after the scheme and before the host; an @ symbol delimits it. For instance, in the URL http://elharo@java.oreilly.com/, the user info is elharo. Some URLs also include passwords in the user info. For instance, in the URL ftp://mp3:secret@ftp.example.com/c%3a/stuff/mp3/, the user info is mp3:secret. However, most of the time, including a password in a URL is a security risk. If the URL doesn’t have any user info, getUserInfo() returns null. Mailto URLs may not behave like you expect. In a URL like mailto:elharo@ibiblio.org, “elharo@ibiblio.org” is the path, not the user info and the host. That’s because the URL specifies the remote recipient of the message rather than the username and host that’s sending the message.

public String getAuthority()

Between the scheme and the path of a URL, you’ll find the authority. This part of the URI indicates the authority that resolves the resource. In the most general case, the authority includes the user info, the host, and the port. For example, in the URL ftp://mp3:mp3@138.247.121.61:21000/c%3a/, the authority is mp3:mp3@138.247.121.61:21000, the user info is mp3:mp3, the host is 138.247.121.61, and the port is 21000. However, not all URLs have all parts. For instance, in the URL http://conferences.oreilly.com/java/speakers/, the authority is simply the hostname conferences.oreilly.com. The getAuthority() method returns the authority as it exists in the URL, with or without the user info and port.

Equality and Comparison

The URL class contains the usual equals() and hashCode() methods. These behave almost as you’d expect. Two URLs are considered equal if and only if both URLs point to the same resource on the same host, port, and path, with the same fragment identifier and query string. However there is one surprise here. The equals() method actually tries to resolve the host with DNS so that, for example, it can tell that http://www.ibiblio.org/ and http://ibiblio.org/ are the same.

This means that equals() on a URL is potentially a blocking I/O operation! For this reason, you should avoid storing URLs in data structure that depend on equals() such as java.util.HashMap. Prefer
java.net.URI for this, and convert back and forth from URIs to URLs when necessary.

On the other hand, equals() does not go so far as to actually compare the resources identified by two URLs. For example, http://www.oreilly.com/ is not equal to http://www.oreilly.com/index.html; and http://www.oreilly.com:80 is not equal to http://www.oreilly.com/.

Conversion

URL has three methods that convert an instance to another form: toString(), toExternalForm(), and toURI().

Like all good classes, java.net.URL has a toString() method. The String produced by toString() is always an absolute URL, such as http://www.cafeaulait.org/javatutorial.html. It’s uncommon to call toString() explicitly. Print statements call to String() implicitly. Outside of print statements, it’s more proper to use toExternalForm() instead:

public String toExternalForm()

The toExternalForm() method converts a URL object to a string that can be used in an HTML link or a web browser’s Open URL dialog. The toExternalForm() method returns a human-readable String representing the URL. It is identical to the toString() method. In fact, all the toString() method does is return toExternalForm(). Finally, the toURI() method converts a URL object to an equivalent URI object:

public URI toURI() throws URISyntaxException

We’ll take up the URI class shortly. In the meantime, the main thing you need to know is that the URI class provides much more accurate, specification-conformant behavior than the URL class. For operations like absolutization and encoding, you should prefer the URI class where you have the option. You should also prefer the URI class if you need to store URLs in a hashtable or other data structure, since its equals() method is not blocking. The URL class should be used primarily when you want to download content from a server.

Network Programming

The URL Class

Introduction to URL Class

Creating new URLs

Important

Constructing a URL from a string

Constructing a URL from its component part

Constructing relative URLs

Other sources of URL object

Retrieving data from URL

public final InputStream openStream() throws IOException

Splitting a URL into pieces

public String getProtocol()

public String getHost()

public int getPort()

public int getDefaultPort()

public String getFile()

public String getPath()

public String getRef()

public String getQuery()

public String getUserInfo()

public String getAuthority()

Equality and Comparison

Conversion

You may like

Leave a Reply Cancel reply

How to whitelist website on AdBlocker?

Network Programming

Introduction to URL Class

Creating new URLs

Important

Constructing a URL from a string

Constructing a URL from its component part

Constructing relative URLs

Other sources of URL object

Retrieving data from URL

public final InputStream openStream() throws IOException

Splitting a URL into pieces

public String getProtocol()

public String getHost()

public int getPort()

public int getDefaultPort()

public String getFile()

public String getPath()

public String getRef()

public String getQuery()

public String getUserInfo()

public String getAuthority()

Equality and Comparison

Conversion

You may like

How can we help?

Leave a Reply Cancel reply

How to whitelist website on AdBlocker?